Linux IP Virtual Server for Netfilter and Linux 2.4/2.6/3.x/4.x
			Details and notes

History:

- 2018-JUN-17:
	- Add SCTP states
	- Add section for Requirements

- 2014-JAN-25:
	- Updated for Linux 3.13: hooks
	- Info for CONFIG_IP_VS_NFCT, xt_ipvs.ko


				Requirements

# Using 'echo 1 > /proc/sys/net/ipv4/vs/conntrack' ?
From 4.10+ conntrack hooks are registered only when netfilter
rules require state inspection, otherwise conntracks are not
created without netfilter rules. conntrack=1 does not force
hooks registration but IPVS-FTP registers the conntrack hooks
for mangling purposes starting from 4.18.

# Do not create netfilter conntracks for IPVS traffic (faster, not
# for FTP-NAT):
# Remote clients:
iptables -t raw -A PREROUTING -p tcp -d VIP --dport VPORT -j CT --notrack
# If using local clients to VIP:VPORT:
iptables -t raw -A OUTPUT -o lo -p tcp -d VIP --dport VPORT -j CT --notrack
# Now, if needed, instead of RELATED,ESTABLISHED use -m state --state UNTRACKED

- ip_vs_ftp (only for NAT method):
	# Requires Netfilter conntracks
	modprobe nf_conntrack_ftp
	modprobe iptable_filter
	modprobe nf_nat_ftp
	# iptables_nat is needed from 2.6.36 (with NFCT in kernel):
	modprobe iptables_nat

	# For Passive FTP over NAT (works only on module load),
	# before Linux 6.0:
	echo "options nf_conntrack nf_conntrack_helper=1" > /etc/modprobe.d/helper.conf
	or
	echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper
	# Available from 3.5, needed for IPVS 4.7 - 5.19

	# nf_conntrack_helper is not present in Linux 6.0
	# To assign helper starting from Linux 6.0 compile
	# 'raw table support (required for NOTRACK/TRACE)' and
	# '"CT" target support (NEW)' and assign 'ftp' helper for the
	# FTP services
	iptables -t raw -A PREROUTING -p tcp -d VIP --dport 21 -j CT --helper ftp
	# Same can be needed for OUTPUT for local clients

	# Make sure conntrack hooks are registered in kernels 4.10-4.17,
	# add the needed netfilter rules for this. Otherwise, the PASV response
	# from real server is not mangled.

	# IPv6 support for ip_vs_ftp (EPRT/EPSV)
	Linux 4.18+

# Using local client? Avoid VIP as source address, use different
# director IP:
ip route replace local VIP dev DEV proto kernel scope host src DIP

# Avoid drops when ip_vs_ftp is used. IPVS adjusts seq numbers
# which confuses TCP conntrack.
after 2.6.22:
echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal
or before 2.6.22:
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

# IPv4 tunnels on real server
modprobe ipip
ifconfig tunl0 0.0.0.0 up
echo 0 > /proc/sys/net/ipv4/conf/tunl0/rp_filter
# Add VIP on lo
ip addr add VIP/32 dev lo

# IPv6 tunnels on real server
modprobe ip6_tunnel
ifconfig ip6tnl0 up
# Add VIP on lo
ip -6 addr add VIP/128 dev lo preferred_lft 0


				Configuration

	CONFIG_IP_VS_NFCT (2.6.37+):
		- if enabled allows IPVS to keep the netfilter conntrack
		created from first packet until the IPVS connection is alive.
		By default, IPVS destroys the conntrack on every packet.
		Tests show that if netfilter conntracking is working the
		CONFIG_IP_VS_NFCT=y and conntrack=1 work faster compared to
		when CONFIG_IP_VS_NFCT=n or conntrack=0.
		- Values:
			y:
				- Use /proc/sys/net/ipv4/vs/conntrack=1
				when IPVS packets need inspection by netfilter
				in POST_ROUTING or later in INPUT hook.
				It allows xt_ipvs.ko to work after 2.6.37+
			n:
				- destroy conntrack (skb->nfct) if present.
				Can be slow because next packet again creates
				conntrack.

xt_ipvs.ko:
	- In 2.6.37+ /proc/sys/net/ipv4/vs/conntrack must be set to 1
	- fixed to work in 3.0 (commit afb523c54718da57)


The Netfilter hooks in 3.13:

	Priorities:
		NF_IP_PRI_CONNTRACK_DEFRAG = -400,
		NF_IP_PRI_RAW = -300,
		NF_IP_PRI_SELINUX_FIRST = -225,
		NF_IP_PRI_CONNTRACK = -200,
		NF_IP_PRI_MANGLE = -150,
		NF_IP_PRI_NAT_DST = -100,
		NF_IP_PRI_FILTER = 0,
		NF_IP_PRI_SECURITY = 50,
		NF_IP_PRI_NAT_SRC = 100,
		NF_IP_PRI_SELINUX_LAST = 225,
		NF_IP_PRI_CONNTRACK_HELPER = 300,
		NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,

IPv4 hooks:

PRE_ROUTING (ip_input.c:ip_rcv()):
	CONNTRACK_DEFRAG=-400, nf_defrag_ipv4.c:ipv4_conntrack_defrag()
		- hold IP fragments, return reassembled packet
	RAW=-300, iptable_raw.c:iptable_raw_hook()
		- before conntrack, useful for NOTRACK target
	CONNTRACK=-200, nf_conntrack_l3proto_ipv4.c:ipv4_conntrack_in()
		- find or create conntrack
	MANGLE=-150, iptable_mangle.c:iptable_mangle_hook()
		- alter packet
	NAT_DST=-100, iptable_nat.c:nf_nat_ipv4_in()
		- DNAT: change daddr, dport
	FILTER=0:
		- 2.4: ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect
	FILTER+1=1:
		- 2.4: net/sched/sch_ingress.c:ing_hook

LOCAL_IN (ip_input.c:ip_local_deliver):
	MANGLE=-150, iptable_mangle.c:iptable_mangle_hook()
	FILTER=0, iptable_filter.c:iptable_filter_hook()
	LVS=98 (NAT_SRC-2), ip_vs_reply4():
		- check for reply from remote real server to client
		- must be before ip_vs_remote_request4()
	LVS=99 (NAT_SRC-1), ip_vs_remote_request4():
		- forward/schedule packets from remote clients to real server
		- must be before NAT_SRC
	NAT_SRC=100, iptable_nat.c:nf_nat_ipv4_fn()
		- SNAT: change saddr, sport
	LAST-1:
		- 2.4: ip_fw_compat.c:fw_confirm
	CONNTRACK_HELPER=300, nf_conntrack_l3proto_ipv4.c:ipv4_helper()
	CONNTRACK_CONFIRM=INT_MAX, nf_conntrack_l3proto_ipv4.c:ipv4_confirm()

FORWARD (ip_forward.c:ip_forward):
	MANGLE=-150, iptable_mangle.c:iptable_mangle_hook()
	FILTER=0:
		- iptable_filter.c:iptable_filter_hook()
		- 2.4:
			ip_fw_compat.c:fw_in(), firewall
			LVS:check_for_ip_vs_out(), masquerade
	LVS=99: ip_vs_forward_icmp():
		- forward related ICMP errors from internal client
		to transparent proxy server
	LVS=100: ip_vs_reply4():
		- forward replies from remote real server to client

LOCAL_OUT (ip_output.c):
	CONNTRACK_DEFRAG=-400, nf_defrag_ipv4.c:ipv4_conntrack_defrag
	RAW=-300, iptable_raw.c:iptable_raw_hook()
	CONNTRACK=-200, nf_conntrack_l3proto_ipv4.c:ipv4_conntrack_local()
	MANGLE=-150, iptable_mangle.c:iptable_mangle_hook()
	NAT_DST=-100, iptable_nat.c:nf_nat_ipv4_local_fn()
	LVS=-99 (NAT_DST+1), ip_vs_local_reply4():
		- check for reply from local real server
		- must be after NAT_DST
		- must be before ip_vs_local_request4()
	LVS=-98 (NAT_DST+2), ip_vs_local_request4():
		- forward/schedule packets from local clients
		- must be after NAT_DST
		- must be after ip_vs_local_reply4()
	FILTER=0, iptable_filter.c:iptable_filter_hook()

POST_ROUTING (ip_output.c:ip_finish_output):
	MANGLE=-150, iptable_mangle.c:iptable_mangle_hook()
	FILTER=0:
		- 2.4: ip_fw_compat.c:fw_in, firewall, unredirect,
		mangle ICMP replies
	LVS=NAT_SRC-1, ip_vs_post_routing (STOP and avoid double NAT)
	NAT_SRC=100, iptable_nat.c:nf_nat_ipv4_out()
	CONNTRACK_HELPER=300, nf_conntrack_l3proto_ipv4.c:ipv4_helper()
	CONNTRACK_CONFIRM=INT_MAX, nf_conntrack_l3proto_ipv4.c:ipv4_confirm()

Hooks per handler:

	CONNTRACK_DEFRAG:
		PRE_ROUTING, LOCAL_OUT

	CONNTRACK:
		PRE_ROUTING, LOCAL_OUT

	FILTER:
		LOCAL_IN, FORWARD, LOCAL_OUT

	MANGLE:
		PRE_ROUTING, LOCAL_IN, FORWARD, LOCAL_OUT, POST_ROUTING

	DNAT:
		PRE_ROUTING, LOCAL_OUT

	SNAT:
		LOCAL_IN, POST_ROUTING


The chains:

The out->in LVS requests (for any forwarding method) walk:

pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING


	LOCAL_IN
	ip_vs_in	-> ip_route_output/dst cache
			-> mark skb->nfcache with special bit value
			-> 2.4: ip_send -> POST_ROUTING
			-> 2.6: LOCAL_OUT

	POST_ROUTING
	ip_vs_post_routing
			- Avoid double NAT. This is done by maintaining
			skb->ipvs_property for 2.6 and
			skb->nfcache & NFC_IPVS_PROPERTY for 2.4


The in->out LVS replies (for LVS/NAT) walk:

pre_routing -> FORWARD -> POST_ROUTING

	FORWARD (check for related ICMP):
	ip_vs_forward_icmp	-> local delivery -> mark
				skb->nfcache/ipvs_property ->
				(LOCAL_OUT for 2.6) -> POST_ROUTING

	FORWARD
	ip_vs_out		-> NAT -> mark skb->nfcache/ipvs_property ->
				NF_ACCEPT

	POST_ROUTING
	ip_vs_post_routing (before 2.6.37)
			- Avoid double NAT. This is done by maintaining
			skb->nfcache & NFC_IPVS_PROPERTY for 2.4

Why LVS is placed there:

- LVS creates connections after the packets are marked, i.e. after
PRE_ROUTING/LOCAL_IN:MANGLE:-150 or PRE_ROUTING:FILTER:0. LVS can use the
skb->nfmark as a virtual service ID.

- For 2.4 LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS
setups. By this way the incoming traffic can be policed before reaching
LVS.

- LVS creates connections after the input routing because the routing
can decide to deliver locally packets that are marked or other packets
specified with routing rules. Transparent proxying handled from the
netfilter NAT code is not always a good solution because it mangles
destination address in IP header.

- LVS needs to forward packets not looking in the IP header (direct
routing method), so calling ip_route_input with arguments from the IP
header only is not useful for LVS. Also, Netfilter can reroute IPVS
packets if some field is mangled with rules, we try to avoid that, it
can be fatal for LVS-DR where daddr=VIP.

- LVS is after any firewall rules in LOCAL_IN and FORWARD


*** Requirements for the PRE_ROUTING chain ***

	Sorry, we can't waste time here. The netfilter connection
tracking can mangle packets here and we don't know at this time if a
packet is for our virtual service (new connection) or for existing
connection (needs lookup in the LVS connection table). We are sure that
we can't make decisions whether to create new connections at this place
but lookup for existing connections is possible under some conditions:
the packets must be defragmented (2.4 only), etc.

There are so many nice modules in this chain that can feed LVS with
packets (probably modified)


*** Requirements for the LOCAL_IN chain ***

The conditions when sk_buff comes:

- ip_local_deliver() defragments the packets (ip_defrag) for us

- the incoming sk_buff can be non-linear

- when the incoming sk_buff comes only the read access is guaranteed

What we do:

- packets generated locally are not considered because there is no
known forwarding method that can establish connection initiated from
the director. 2.6.37+ can schedule requests in LOCAL_OUT.

- only TCP, UDP, SCTP and related to them ICMP errors are considered

- the protocol header must be present before making any work based on
fields from the IP or protocol header. In 2.6 we extract parts of the
header on demand.

- we detect here packets for the virtual services or packets for
existing connections and then the transmitter function for the used
forwarding method is called

- the NAT transmitter performs the following actions:

		We try to make some optimizations for the most of the
	traffic we see: the normal traffic that is not bound to any
	application helper, i.e. when the data part (payload) in the
	packets is not written or even not read at all. In such case, we
	change the addresses and the ports in the IP and in the protocol
	header but we don't make any checksum checking for them. We
	perform incremental checksum update after the packet is mangled
	and rely on the real server to perform the full check (headers
	and payload).

		If the connection is bound to some application helper
	(FTP for example) we always perform checksum checking with the
	assumption that the data is usually changed and with the
	additional assumption that the traffic using application
	helpers is low. To perform such check the whole payload should
	be present in the provided sk_buff. For this, we call functions
	to linearize the sk_buff data by assembling all its data
	fragments.

		Before the addresses and ports are changed we should
	have write access to the packet data (headers and payload).
	This guarantees that the packet data should be seen from any
	other readers unchanged. The copy-on-write is performed from
	the linearization function for the packets that were with many
	fragments. For all other packets we should copy the packet data
	(headers and payload) if it is used from someone else (the
	sk_buff was cloned). The packets not bound to application
	helpers need such write access only for the first fragment
	because for them only the IP and the protocol headers are
	changed and we guarantee that they are in the first fragment.
	For the packets using application helpers the linearization is
	already done and we are sure that there is only one fragment. As
	result, we need write access (copy if cloned) only for the first
	fragment. After the application helper is called to update the
	packet data we perform full checksum calculation.

- the DR transmitter performs the following actions:

		Nothing special, may be it is the shortest function. The
	only action is to reroute the packet to the bound real server.
	If the packet is fragmented then ip_send_check() should be
	called to refresh the checksum.

- the TUN transmitter performs the following actions:

		Copies the packet if is already referred from someone
	else or when there is no space for the IPIP prefix header. The
	packet is rerouted to the real server. If the packet is
	fragmented then ip_send_check() should be called to refresh the
	checksum in the old IP header.

- if the packets must leave the box we send them to POST_ROUTING via
ip_send and return NF_STOLEN for 2.4 or we send them to LOCAL_OUT and
return NF_STOP for 2.6 in ip_vs_post_routing. This means that we remove
the packet from the LOCAL_IN chain before reaching priority LAST-1. The
LocalNode feature just returns NF_ACCEPT without mangling the packet.


	In this chain if a packet is for LVS connection (even newly
created) the LVS calls ip_route_output (or uses a destination cache),
marks the packet as a LVS property (sets bit in skb->nfcache) and calls
ip_send() to jump to the POST_ROUTING chain (2.4) or jumps to LOCAL_OUT
and considers skb->ipvs_property in POST_ROUTING (2.6 before 2.6.37) to
avoid double NAT. There our ip_vs_post_routing hook must call the okfn
for the packets with our special nfcache bit value (Is skb->nfcache used
after the routing calls? We rely on the fact that it is not used) and to
return NF_STOLEN (2.4) or NF_STOP (2.6).

	One side effect: LVS can forward packet even when ip_forward=0,
only for DR and TUN methods. For these methods even TTL is not
decremented nor data checksum is checked.


*** Requirements for the FORWARD chain ***

	FORWARD:99 - LVS checks first for ICMP packets related to
TCP or UDP connections. Such packets are handled as they are received
in the LOCAL_IN chain - they are localy delivered. Used for
transparent proxy setups.

	LVS looks in this chain for in->out packets but only for the
LVS/NAT method. We still can see replies from the local real server,
even for LVS/DR method. In any case new connections are not created
here, the lookup is for existing connections only.

	In this chain the ip_vs_out function can be called from many
places:

	FORWARD:0 (2.4) - the ipfw compat mode calls ip_vs_out between
the forward firewall and the masquerading. By this way LVS can grab the
outgoing packets for its connection and to avoid they to be used from
the netfilter's NAT code.

	FORWARD:100 - ip_vs_out is registered after the FILTER=0. We
can come here twice if the ipfw compat module is used because ip_vs_out
is called once from FORWARD:0 (fw_in) and after that from pri=100 where
LVS always registers the ip_vs_out function. We detect this second call
by looking in the skb->nfcache bit value. If the bit is set we return
NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first
returns NF_STOLEN and after calling the okfn function.

The actions we perform are the same as in the LOCAL_IN chain for the
NAT transmitter with the exception that we should call ip_defrag(). The
other difference is that we have write access to the first fragment (it
is not referred from someone else) after ip_forward() calls skb_cow().


*** Requirements for the POST_ROUTING chain ***

	LVS marks the packets for debugging and they appear to come from
LOCAL_OUT but this chain is not traversed in 2.4. The LVS requirements
from the POST_ROUTING chain include the fragmentation code only. But
even the ICMP messages are generated and mangled ready for sending long
before the POST_ROUTING chain: ip_send() does not call ip_fragment() for
the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is
shorter. 2.6 has different handling for fragmentation.

	LVS makes MTU checks when accepting packets and selecting the
output device. So, the ip_refrag POST_ROUTING hook is not used from LVS.

	The result is: LVS must hook POST_ROUTING as first (may be only
after the ipfw compat filter) and to return NF_STOLEN for its packets
(detected by checking the special skb->nfcache bit value). In 2.6.37
and above we do not hook at POST_ROUTING (commit cf356d69db0afef6).


				IPVS with SCTP

- ipvsadm with --sctp-service option: from version 1.28

INPUT-ONLY (DR/TUN):
	Client initiated:
	C: INIT => sI1
	C: COOKIE-ECHO => sES

	Server initiated:
	C: INIT-ACK => sCE
	C: COOKIE-ACK => sES

	Client shutdown:
	C: SHUTDOWN => sSR
	C: SHUTDOWN-COMPLETE => sCL

	Server shutdown:
	C: SHUTDOWN-ACK => sCL

INPUT+OUTPUT (NAT):
	Client initiated:
	C: INIT => sI1, sIN
	S: INIT-ACK => sCS
	C: COOKIE-ECHO => sCR
	S: COOKIE-ACK => sES

	Server initiated:
	S: INIT => sCW
	C: INIT-ACK => sCO
	S: COOKIE-ECHO => sCE
	C: COOKIE-ACK => sES

	Client shutdown:
	C: SHUTDOWN => sSR
	S: SHUTDOWN-ACK => sSA
	C: SHUTDOWN-COMPLETE => sCL

	Server shutdown:
	S: SHUTDOWN => sSS
	C: SHUTDOWN-ACK => sCL
	S: SHUTDOWN-COMPLETE => sCL

	Dual shutdown, client first:
	C: SHUTDOWN => sSR
	S: SHUTDOWN => sSR
	C: SHUTDOWN-ACK => sSR
	S: SHUTDOWN-ACK => sSA
	C: SHUTDOWN-COMPLETE => sCL
	S: SHUTDOWN-COMPLETE => sCL

	Dual shutdown, server first:
	S: SHUTDOWN => sSS
	C: SHUTDOWN => sSS
		S: SHUTDOWN-ACK: sSS => sSA, sCL => sCL
		or
		C: SHUTDOWN-ACK: sSS => sCL, sSA => sCL, sCL => sCL
	S: SHUTDOWN-COMPLETE: (impossible?: sSA => sSA), sCL => sCL
	or
	C: SHUTDOWN-COMPLETE: sSA => sCL, sCL => sCL


IPVS TODO (a wish list):

- redesign LVS to work in setups with multiple default routes (this
requires changes in the kernels, calling ip_route_input with different
arguments). The end goal: one routing call in any direction (as before)
but do correct routing in in->out direction. The problems:

	* In 2.6+ Ingress qdisc is not at PRE_ROUTING:FILTER+1, it is
	just before bridging, so it is good for us

	* fwmark virtual services and the need for working at prerouting.
	Solution: hook at PREROUTING after the filter and do there
	connection creation (after QoS, fwmark setup). Hook at prerouting,
	listen for traffic for established connections and call
	ip_route_input with the right arguments (possibly in the routing
	chain).
	Goal: always pass one filter chain in each direction (FORWARD).
	The fwmark is used only for connection setup and then is ignored.

	* hash twice the NAT connections in same table (at prerouting
	we can see both requests and replies), compare with cp->vaddr to
	detect the right direction

	* working at PRE_ROUTING is dangerous. Some things do not work:

		- icmp_send() can be called after input routing or ICMP
		errors can not be generated (port unreachable, need to
		fragment)

	* Current kernels are not helpful for VIP-less directors, the
	ICMP errors are not sent if VIP is not a local IP address.

	* now IPVS exits POST_ROUTING chain to avoid NAT processing in
	netfilter. Long time ago the problem was that ip_vs_ftp mangles
	FTP data and it was disaster to pass IPVS packets in netfilter's
	hands. Now after many changes in latest kernels I'm not sure
	what happens if netfilter sees IPVS traffic in POST_ROUTING.
	Such change require testing of ip_vs_ftp in both passive and
	active LVS-NAT mode, with different length of IP address:port
	representation in FTP commands, to check if resulting packets
	survive double NAT when payload size is changed. It is the best
	test for IPVS to see if netfilter additionally changes FTP
	packets leading to wrong payload. Other tests include testing
	fragmentation (output device with lower MTU in different
	forwarding methods, both for in->out and out->in traffic).

	* some users use local routes to deliver selected traffic for
	IPVS scheduling. This was one of the ways to route traffic to
	real servers with transparent proxy support.

- help from Netfilter to redesign the kernel hooks:

	* ROUTING hook (used from netfilter's NAT, LVS-DR and in->out
	LVS-NAT)

	* fixed ip_route_input to do source routing with the masquerade
	address as source (lsrc argument), see
	http://ja.ssi.bg/dgd-usage.txt

	* more control over what to walk in the netfilter hooks?

- different timeouts for each virtual server (more control over the
connection timeouts)

- Allow LVS to be used as NAT router/balancer for outgoing traffic: in
addition to TCP and UDP add support to balance ICMP messages (not ICMP
errors)