Linux IP Virtual Server for Netfilter and Linux 2.4/2.6/3.x/4.x Details and notes History: - 2018-JUN-17: - Add SCTP states - Add section for Requirements - 2014-JAN-25: - Updated for Linux 3.13: hooks - Info for CONFIG_IP_VS_NFCT, xt_ipvs.ko Requirements # Using 'echo 1 > /proc/sys/net/ipv4/vs/conntrack' ? From 4.10+ conntrack hooks are registered only when netfilter rules require state inspection, otherwise conntracks are not created without netfilter rules. conntrack=1 does not force hooks registration but IPVS-FTP registers the conntrack hooks for mangling purposes starting from 4.18. # Do not create netfilter conntracks for IPVS traffic (faster, not # for FTP-NAT): # Remote clients: iptables -t raw -A PREROUTING -p tcp -d VIP --dport VPORT -j CT --notrack # If using local clients to VIP:VPORT: iptables -t raw -A OUTPUT -o lo -p tcp -d VIP --dport VPORT -j CT --notrack # Now, if needed, instead of RELATED,ESTABLISHED use -m state --state UNTRACKED - ip_vs_ftp (only for NAT method): # Requires Netfilter conntracks modprobe nf_conntrack_ftp modprobe iptable_filter modprobe nf_nat_ftp # iptables_nat is needed from 2.6.36 (with NFCT in kernel): modprobe iptables_nat # For Passive FTP over NAT (works only on module load), # before Linux 6.0: echo "options nf_conntrack nf_conntrack_helper=1" > /etc/modprobe.d/helper.conf or echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper # Available from 3.5, needed for IPVS 4.7 - 5.19 # nf_conntrack_helper is not present in Linux 6.0 # To assign helper starting from Linux 6.0 compile # 'raw table support (required for NOTRACK/TRACE)' and # '"CT" target support (NEW)' and assign 'ftp' helper for the # FTP services iptables -t raw -A PREROUTING -p tcp -d VIP --dport 21 -j CT --helper ftp # Same can be needed for OUTPUT for local clients # Make sure conntrack hooks are registered in kernels 4.10-4.17, # add the needed netfilter rules for this. Otherwise, the PASV response # from real server is not mangled. # IPv6 support for ip_vs_ftp (EPRT/EPSV) Linux 4.18+ # Using local client? Avoid VIP as source address, use different # director IP: ip route replace local VIP dev DEV proto kernel scope host src DIP # Avoid drops when ip_vs_ftp is used. IPVS adjusts seq numbers # which confuses TCP conntrack. after 2.6.22: echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal or before 2.6.22: echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal # IPv4 tunnels on real server modprobe ipip ifconfig tunl0 0.0.0.0 up echo 0 > /proc/sys/net/ipv4/conf/tunl0/rp_filter # Add VIP on lo ip addr add VIP/32 dev lo # IPv6 tunnels on real server modprobe ip6_tunnel ifconfig ip6tnl0 up # Add VIP on lo ip -6 addr add VIP/128 dev lo preferred_lft 0 Configuration CONFIG_IP_VS_NFCT (2.6.37+): - if enabled allows IPVS to keep the netfilter conntrack created from first packet until the IPVS connection is alive. By default, IPVS destroys the conntrack on every packet. Tests show that if netfilter conntracking is working the CONFIG_IP_VS_NFCT=y and conntrack=1 work faster compared to when CONFIG_IP_VS_NFCT=n or conntrack=0. - Values: y: - Use /proc/sys/net/ipv4/vs/conntrack=1 when IPVS packets need inspection by netfilter in POST_ROUTING or later in INPUT hook. It allows xt_ipvs.ko to work after 2.6.37+ n: - destroy conntrack (skb->nfct) if present. Can be slow because next packet again creates conntrack. xt_ipvs.ko: - In 2.6.37+ /proc/sys/net/ipv4/vs/conntrack must be set to 1 - fixed to work in 3.0 (commit afb523c54718da57) The Netfilter hooks in 3.13: Priorities: NF_IP_PRI_CONNTRACK_DEFRAG = -400, NF_IP_PRI_RAW = -300, NF_IP_PRI_SELINUX_FIRST = -225, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_SECURITY = 50, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_SELINUX_LAST = 225, NF_IP_PRI_CONNTRACK_HELPER = 300, NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX, IPv4 hooks: PRE_ROUTING (ip_input.c:ip_rcv()): CONNTRACK_DEFRAG=-400, nf_defrag_ipv4.c:ipv4_conntrack_defrag() - hold IP fragments, return reassembled packet RAW=-300, iptable_raw.c:iptable_raw_hook() - before conntrack, useful for NOTRACK target CONNTRACK=-200, nf_conntrack_l3proto_ipv4.c:ipv4_conntrack_in() - find or create conntrack MANGLE=-150, iptable_mangle.c:iptable_mangle_hook() - alter packet NAT_DST=-100, iptable_nat.c:nf_nat_ipv4_in() - DNAT: change daddr, dport FILTER=0: - 2.4: ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect FILTER+1=1: - 2.4: net/sched/sch_ingress.c:ing_hook LOCAL_IN (ip_input.c:ip_local_deliver): MANGLE=-150, iptable_mangle.c:iptable_mangle_hook() FILTER=0, iptable_filter.c:iptable_filter_hook() LVS=98 (NAT_SRC-2), ip_vs_reply4(): - check for reply from remote real server to client - must be before ip_vs_remote_request4() LVS=99 (NAT_SRC-1), ip_vs_remote_request4(): - forward/schedule packets from remote clients to real server - must be before NAT_SRC NAT_SRC=100, iptable_nat.c:nf_nat_ipv4_fn() - SNAT: change saddr, sport LAST-1: - 2.4: ip_fw_compat.c:fw_confirm CONNTRACK_HELPER=300, nf_conntrack_l3proto_ipv4.c:ipv4_helper() CONNTRACK_CONFIRM=INT_MAX, nf_conntrack_l3proto_ipv4.c:ipv4_confirm() FORWARD (ip_forward.c:ip_forward): MANGLE=-150, iptable_mangle.c:iptable_mangle_hook() FILTER=0: - iptable_filter.c:iptable_filter_hook() - 2.4: ip_fw_compat.c:fw_in(), firewall LVS:check_for_ip_vs_out(), masquerade LVS=99: ip_vs_forward_icmp(): - forward related ICMP errors from internal client to transparent proxy server LVS=100: ip_vs_reply4(): - forward replies from remote real server to client LOCAL_OUT (ip_output.c): CONNTRACK_DEFRAG=-400, nf_defrag_ipv4.c:ipv4_conntrack_defrag RAW=-300, iptable_raw.c:iptable_raw_hook() CONNTRACK=-200, nf_conntrack_l3proto_ipv4.c:ipv4_conntrack_local() MANGLE=-150, iptable_mangle.c:iptable_mangle_hook() NAT_DST=-100, iptable_nat.c:nf_nat_ipv4_local_fn() LVS=-99 (NAT_DST+1), ip_vs_local_reply4(): - check for reply from local real server - must be after NAT_DST - must be before ip_vs_local_request4() LVS=-98 (NAT_DST+2), ip_vs_local_request4(): - forward/schedule packets from local clients - must be after NAT_DST - must be after ip_vs_local_reply4() FILTER=0, iptable_filter.c:iptable_filter_hook() POST_ROUTING (ip_output.c:ip_finish_output): MANGLE=-150, iptable_mangle.c:iptable_mangle_hook() FILTER=0: - 2.4: ip_fw_compat.c:fw_in, firewall, unredirect, mangle ICMP replies LVS=NAT_SRC-1, ip_vs_post_routing (STOP and avoid double NAT) NAT_SRC=100, iptable_nat.c:nf_nat_ipv4_out() CONNTRACK_HELPER=300, nf_conntrack_l3proto_ipv4.c:ipv4_helper() CONNTRACK_CONFIRM=INT_MAX, nf_conntrack_l3proto_ipv4.c:ipv4_confirm() Hooks per handler: CONNTRACK_DEFRAG: PRE_ROUTING, LOCAL_OUT CONNTRACK: PRE_ROUTING, LOCAL_OUT FILTER: LOCAL_IN, FORWARD, LOCAL_OUT MANGLE: PRE_ROUTING, LOCAL_IN, FORWARD, LOCAL_OUT, POST_ROUTING DNAT: PRE_ROUTING, LOCAL_OUT SNAT: LOCAL_IN, POST_ROUTING The chains: The out->in LVS requests (for any forwarding method) walk: pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING LOCAL_IN ip_vs_in -> ip_route_output/dst cache -> mark skb->nfcache with special bit value -> 2.4: ip_send -> POST_ROUTING -> 2.6: LOCAL_OUT POST_ROUTING ip_vs_post_routing - Avoid double NAT. This is done by maintaining skb->ipvs_property for 2.6 and skb->nfcache & NFC_IPVS_PROPERTY for 2.4 The in->out LVS replies (for LVS/NAT) walk: pre_routing -> FORWARD -> POST_ROUTING FORWARD (check for related ICMP): ip_vs_forward_icmp -> local delivery -> mark skb->nfcache/ipvs_property -> (LOCAL_OUT for 2.6) -> POST_ROUTING FORWARD ip_vs_out -> NAT -> mark skb->nfcache/ipvs_property -> NF_ACCEPT POST_ROUTING ip_vs_post_routing (before 2.6.37) - Avoid double NAT. This is done by maintaining skb->nfcache & NFC_IPVS_PROPERTY for 2.4 Why LVS is placed there: - LVS creates connections after the packets are marked, i.e. after PRE_ROUTING/LOCAL_IN:MANGLE:-150 or PRE_ROUTING:FILTER:0. LVS can use the skb->nfmark as a virtual service ID. - For 2.4 LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS setups. By this way the incoming traffic can be policed before reaching LVS. - LVS creates connections after the input routing because the routing can decide to deliver locally packets that are marked or other packets specified with routing rules. Transparent proxying handled from the netfilter NAT code is not always a good solution because it mangles destination address in IP header. - LVS needs to forward packets not looking in the IP header (direct routing method), so calling ip_route_input with arguments from the IP header only is not useful for LVS. Also, Netfilter can reroute IPVS packets if some field is mangled with rules, we try to avoid that, it can be fatal for LVS-DR where daddr=VIP. - LVS is after any firewall rules in LOCAL_IN and FORWARD *** Requirements for the PRE_ROUTING chain *** Sorry, we can't waste time here. The netfilter connection tracking can mangle packets here and we don't know at this time if a packet is for our virtual service (new connection) or for existing connection (needs lookup in the LVS connection table). We are sure that we can't make decisions whether to create new connections at this place but lookup for existing connections is possible under some conditions: the packets must be defragmented (2.4 only), etc. There are so many nice modules in this chain that can feed LVS with packets (probably modified) *** Requirements for the LOCAL_IN chain *** The conditions when sk_buff comes: - ip_local_deliver() defragments the packets (ip_defrag) for us - the incoming sk_buff can be non-linear - when the incoming sk_buff comes only the read access is guaranteed What we do: - packets generated locally are not considered because there is no known forwarding method that can establish connection initiated from the director. 2.6.37+ can schedule requests in LOCAL_OUT. - only TCP, UDP, SCTP and related to them ICMP errors are considered - the protocol header must be present before making any work based on fields from the IP or protocol header. In 2.6 we extract parts of the header on demand. - we detect here packets for the virtual services or packets for existing connections and then the transmitter function for the used forwarding method is called - the NAT transmitter performs the following actions: We try to make some optimizations for the most of the traffic we see: the normal traffic that is not bound to any application helper, i.e. when the data part (payload) in the packets is not written or even not read at all. In such case, we change the addresses and the ports in the IP and in the protocol header but we don't make any checksum checking for them. We perform incremental checksum update after the packet is mangled and rely on the real server to perform the full check (headers and payload). If the connection is bound to some application helper (FTP for example) we always perform checksum checking with the assumption that the data is usually changed and with the additional assumption that the traffic using application helpers is low. To perform such check the whole payload should be present in the provided sk_buff. For this, we call functions to linearize the sk_buff data by assembling all its data fragments. Before the addresses and ports are changed we should have write access to the packet data (headers and payload). This guarantees that the packet data should be seen from any other readers unchanged. The copy-on-write is performed from the linearization function for the packets that were with many fragments. For all other packets we should copy the packet data (headers and payload) if it is used from someone else (the sk_buff was cloned). The packets not bound to application helpers need such write access only for the first fragment because for them only the IP and the protocol headers are changed and we guarantee that they are in the first fragment. For the packets using application helpers the linearization is already done and we are sure that there is only one fragment. As result, we need write access (copy if cloned) only for the first fragment. After the application helper is called to update the packet data we perform full checksum calculation. - the DR transmitter performs the following actions: Nothing special, may be it is the shortest function. The only action is to reroute the packet to the bound real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum. - the TUN transmitter performs the following actions: Copies the packet if is already referred from someone else or when there is no space for the IPIP prefix header. The packet is rerouted to the real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum in the old IP header. - if the packets must leave the box we send them to POST_ROUTING via ip_send and return NF_STOLEN for 2.4 or we send them to LOCAL_OUT and return NF_STOP for 2.6 in ip_vs_post_routing. This means that we remove the packet from the LOCAL_IN chain before reaching priority LAST-1. The LocalNode feature just returns NF_ACCEPT without mangling the packet. In this chain if a packet is for LVS connection (even newly created) the LVS calls ip_route_output (or uses a destination cache), marks the packet as a LVS property (sets bit in skb->nfcache) and calls ip_send() to jump to the POST_ROUTING chain (2.4) or jumps to LOCAL_OUT and considers skb->ipvs_property in POST_ROUTING (2.6 before 2.6.37) to avoid double NAT. There our ip_vs_post_routing hook must call the okfn for the packets with our special nfcache bit value (Is skb->nfcache used after the routing calls? We rely on the fact that it is not used) and to return NF_STOLEN (2.4) or NF_STOP (2.6). One side effect: LVS can forward packet even when ip_forward=0, only for DR and TUN methods. For these methods even TTL is not decremented nor data checksum is checked. *** Requirements for the FORWARD chain *** FORWARD:99 - LVS checks first for ICMP packets related to TCP or UDP connections. Such packets are handled as they are received in the LOCAL_IN chain - they are localy delivered. Used for transparent proxy setups. LVS looks in this chain for in->out packets but only for the LVS/NAT method. We still can see replies from the local real server, even for LVS/DR method. In any case new connections are not created here, the lookup is for existing connections only. In this chain the ip_vs_out function can be called from many places: FORWARD:0 (2.4) - the ipfw compat mode calls ip_vs_out between the forward firewall and the masquerading. By this way LVS can grab the outgoing packets for its connection and to avoid they to be used from the netfilter's NAT code. FORWARD:100 - ip_vs_out is registered after the FILTER=0. We can come here twice if the ipfw compat module is used because ip_vs_out is called once from FORWARD:0 (fw_in) and after that from pri=100 where LVS always registers the ip_vs_out function. We detect this second call by looking in the skb->nfcache bit value. If the bit is set we return NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first returns NF_STOLEN and after calling the okfn function. The actions we perform are the same as in the LOCAL_IN chain for the NAT transmitter with the exception that we should call ip_defrag(). The other difference is that we have write access to the first fragment (it is not referred from someone else) after ip_forward() calls skb_cow(). *** Requirements for the POST_ROUTING chain *** LVS marks the packets for debugging and they appear to come from LOCAL_OUT but this chain is not traversed in 2.4. The LVS requirements from the POST_ROUTING chain include the fragmentation code only. But even the ICMP messages are generated and mangled ready for sending long before the POST_ROUTING chain: ip_send() does not call ip_fragment() for the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is shorter. 2.6 has different handling for fragmentation. LVS makes MTU checks when accepting packets and selecting the output device. So, the ip_refrag POST_ROUTING hook is not used from LVS. The result is: LVS must hook POST_ROUTING as first (may be only after the ipfw compat filter) and to return NF_STOLEN for its packets (detected by checking the special skb->nfcache bit value). In 2.6.37 and above we do not hook at POST_ROUTING (commit cf356d69db0afef6). IPVS with SCTP - ipvsadm with --sctp-service option: from version 1.28 INPUT-ONLY (DR/TUN): Client initiated: C: INIT => sI1 C: COOKIE-ECHO => sES Server initiated: C: INIT-ACK => sCE C: COOKIE-ACK => sES Client shutdown: C: SHUTDOWN => sSR C: SHUTDOWN-COMPLETE => sCL Server shutdown: C: SHUTDOWN-ACK => sCL INPUT+OUTPUT (NAT): Client initiated: C: INIT => sI1, sIN S: INIT-ACK => sCS C: COOKIE-ECHO => sCR S: COOKIE-ACK => sES Server initiated: S: INIT => sCW C: INIT-ACK => sCO S: COOKIE-ECHO => sCE C: COOKIE-ACK => sES Client shutdown: C: SHUTDOWN => sSR S: SHUTDOWN-ACK => sSA C: SHUTDOWN-COMPLETE => sCL Server shutdown: S: SHUTDOWN => sSS C: SHUTDOWN-ACK => sCL S: SHUTDOWN-COMPLETE => sCL Dual shutdown, client first: C: SHUTDOWN => sSR S: SHUTDOWN => sSR C: SHUTDOWN-ACK => sSR S: SHUTDOWN-ACK => sSA C: SHUTDOWN-COMPLETE => sCL S: SHUTDOWN-COMPLETE => sCL Dual shutdown, server first: S: SHUTDOWN => sSS C: SHUTDOWN => sSS S: SHUTDOWN-ACK: sSS => sSA, sCL => sCL or C: SHUTDOWN-ACK: sSS => sCL, sSA => sCL, sCL => sCL S: SHUTDOWN-COMPLETE: (impossible?: sSA => sSA), sCL => sCL or C: SHUTDOWN-COMPLETE: sSA => sCL, sCL => sCL IPVS TODO (a wish list): - redesign LVS to work in setups with multiple default routes (this requires changes in the kernels, calling ip_route_input with different arguments). The end goal: one routing call in any direction (as before) but do correct routing in in->out direction. The problems: * In 2.6+ Ingress qdisc is not at PRE_ROUTING:FILTER+1, it is just before bridging, so it is good for us * fwmark virtual services and the need for working at prerouting. Solution: hook at PREROUTING after the filter and do there connection creation (after QoS, fwmark setup). Hook at prerouting, listen for traffic for established connections and call ip_route_input with the right arguments (possibly in the routing chain). Goal: always pass one filter chain in each direction (FORWARD). The fwmark is used only for connection setup and then is ignored. * hash twice the NAT connections in same table (at prerouting we can see both requests and replies), compare with cp->vaddr to detect the right direction * working at PRE_ROUTING is dangerous. Some things do not work: - icmp_send() can be called after input routing or ICMP errors can not be generated (port unreachable, need to fragment) * Current kernels are not helpful for VIP-less directors, the ICMP errors are not sent if VIP is not a local IP address. * now IPVS exits POST_ROUTING chain to avoid NAT processing in netfilter. Long time ago the problem was that ip_vs_ftp mangles FTP data and it was disaster to pass IPVS packets in netfilter's hands. Now after many changes in latest kernels I'm not sure what happens if netfilter sees IPVS traffic in POST_ROUTING. Such change require testing of ip_vs_ftp in both passive and active LVS-NAT mode, with different length of IP address:port representation in FTP commands, to check if resulting packets survive double NAT when payload size is changed. It is the best test for IPVS to see if netfilter additionally changes FTP packets leading to wrong payload. Other tests include testing fragmentation (output device with lower MTU in different forwarding methods, both for in->out and out->in traffic). * some users use local routes to deliver selected traffic for IPVS scheduling. This was one of the ways to route traffic to real servers with transparent proxy support. - help from Netfilter to redesign the kernel hooks: * ROUTING hook (used from netfilter's NAT, LVS-DR and in->out LVS-NAT) * fixed ip_route_input to do source routing with the masquerade address as source (lsrc argument), see http://ja.ssi.bg/dgd-usage.txt * more control over what to walk in the netfilter hooks? - different timeouts for each virtual server (more control over the connection timeouts) - Allow LVS to be used as NAT router/balancer for outgoing traffic: in addition to TCP and UDP add support to balance ICMP messages (not ICMP errors)