Alternative Routes and Dead Gateway Detection for Linux Julian Anastasov , October 2001 CONTENTS: 0. History and Introduction 1. Static Routes 2. Alternative routes and dead gateway detection 3. Always use the preferred source address as source in our ARP probes 4. Use the gateway as routing key 5. Search for input route by local source IP address (Linux 2.4) 6. Masquerading connection rerouting and incremental checksum updates (Linux 2.2) 7. Netfilter selects the right route (Linux 2.4) 0. History and Introduction 29-NOV-2002 - New patches for 2.4.20 05_nf_reroute-2.4.20-9.diff routes-2.4.20-9.diff 04-MAY-2002 - New patches for 2.4.19 (actually against pre8): 00_static_routes-2.4.19-8.diff 01_alt_routes-2.4.19-8.diff 05_nf_reroute-2.4.19-8.diff routes-2.4.19-8.diff 03-FEB-2002 - New patches for 2.2: routes-2.2.20-7.diff 01_alt_routes-2.2.20-7.diff 02_masq_csum_reroute-2.2.20-7.diff 05_key_gw-2.2.20-7.diff routes-2.2.20-IPVS-1.0.8-7.diff 02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff 14-DEC-2001 - New patches for 2.4 and 2.2: routes-2.4.16-6.diff routes-hidden-forward_shared-noarp-2.4.16-2.diff 00_static_routes-2.4.16-6.diff routes-2.2.20-6.diff 00_static_routes-2.2.20-6.diff 24-NOV-2001 - New patches 05_nf_reroute-2.4.14-5pre2.diff replaces pre1 routes-2.4.14-5.diff 16-NOV-2001 - New patch 05_nf_reroute-2.4.14-5pre1.diff 10-NOV-2001 - New patches 02_masq_csum_reroute-2.2.20-IPVS-1.0.8-4.diff 06_key_gw-2.2.20-IPVS-1.0.8-4.diff Note: IPVS 1.0.8 is applied before all these patches 24-OCT-2001 - New patches added: 00_static_routes-2.4.12-5.diff 01_alt_routes-2.4.12-5.diff 01_arp_prefsrc-2.4.12-5.diff 14-OCT-2001 - New patches added, now routes-2.2.19-4.diff contains: 00_static_routes-2.2.19-4.diff 01_alt_routes-2.2.19-4.diff 01_arp_prefsrc-2.2.19-4.diff 02_masq_csum_reroute-2.2.19-4.diff 05_key_gw-2.2.19-4.diff 11-OCT-2001 - Document created This document contains information about extending the routing functionality in Linux to support alternative routes and better dead gateway detection. The current status of these extensions is: - single patch against 2.2.19+, completed, publicly available - single patch against 2.4.16+, completed(mostly), publicly available The patch includes the following independent parts: - static routes - alternative routes and dead gateway/neighbour detection - always use prefsrc as source address in our ARP probes - masquerading connection rerouting and incremental checksum updates - use the gateway as routing key - define local source IP address for the input routes, used from NAT Patches for Linux 2.2: one of: - for non-LVS users: routes-2.2.20-7.diff - for LVS users: routes-2.2.20-IPVS-1.0.8-7.diff or if applied one by one: 00_static_routes-2.2.20-6.diff 01_alt_routes-2.2.20-7.diff 01_arp_prefsrc-2.2.19-4.diff 02_masq_csum_reroute-2.2.20-7.diff Requires 01_alt_routes-2.2.20-7.diff Not used from the LVS users 02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff Used from the LVS users instead of 02_masq_csum_reroute-2.2.20-7.diff Requires 01_alt_routes-2.2.20-7.diff 05_key_gw-2.2.20-7.diff Requires 00_static_routes-2.2.20-6.diff, 01_alt_routes-2.2.20-7.diff, 01_arp_prefsrc-2.2.19-4.diff and one of the 02_masq_csum_reroute-2.2.20-7.diff or 02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff 06_key_gw-2.2.20-IPVS-1.0.8-4.diff Used from the LVS users only Patches for Linux 2.4: 00_static_routes-2.4.19-8.diff 01_alt_routes-2.4.19-8.diff 01_arp_prefsrc-2.4.12-5.diff 05_nf_reroute-2.4.20-9.diff Requires 00_static_routes-2.4.19-8.diff, 01_alt_routes-2.4.19-8.diff and 01_arp_prefsrc-2.4.12-5.diff Job: new routing keys (gw, lsrc), netfilter prerouting/rerouting 1. Static routes The current Linux versions distinguish the routes by the way they are installed and maintained. In the include/linux/rtnetlink.h header file we can see types such as "KERNEL", "BOOT", "STATIC" and other types mostly identifying different routing daemons. For type "STATIC" we can see a description such as "Route installed by administrator" and later "Values of protocol >= RTPROT_STATIC are not interpreted by kernel". By default, the kernel creates routes from type "KERNEL" and "BOOT". The "STATIC" routes are not touched (may be only from Gated). The interesting part is that some "KERNEL" routes can be deleted when a device's status is changed and of course created automatically when the device goes up. But what we see is that only "KERNEL" routes are recreated automatically from the kernel. The reason for this is that the kernel manages them as result of adding or deleting of IP addresses. The problem we see in some situations is that we can't create route that can survive device state changes. This happens because the routes we create are from type "BOOT" by default and they can't be recreated from the kernel after they are removed. You can ask "Why they are deleted and not marked dead?". Good question. There can be many routes in the system. The question we answer with this patch is "Can the administrator install route that can survive device state changes?". To support such routes we add code in the kernel to handle routes from type "STATIC". We assume that such routes are not changed from the routing daemons because they use their own route type. So, we can safely add "STATIC" routes while the device they are using is down, probably caused from the device driver trying to tell us that we should avoid sending traffic at the moment. The proposed support for static routes handles the following situations: - when a static route is created each path defined in this route is checked and if the used device is down the path is marked as dead. The current kernel refuses creating such routes when the device is down. - when new address is added (followed by creating a "KERNEL" link route that validates gateway in our static route) or device state is changed, each path in all static routes are checked and probably marked alive if their gateway becomes reachable. - when a device goes down all paths in static routes that use the device are marked as dead. The current kernel versions remove all routes with all paths marked dead. In our additions we don't allow it. - it is recommended that when device is unregistered the static routes that use this device to be recreated by the administrator (by removing the paths containing this device). The current device scheme does not allow marking as alive the paths in statuc routes if a newly created device has the same device name as the old unregistered one. When all paths in multipath static route contain unregistered devices then the route is automatically removed. - as result, the static routes are preserved as much as possible according to the presence and state of the used devices and preferred source IP addresses The examples: With "proto static" the following command succeeds even when device ppp0 is DOWN: ip route add 10.0.0.0/24 table 100 proto static \ nexthop dev ppp0 \ nexthop via 192.168.0.1 dev eth0 With "proto static" the following route will not be deleted when device wan0 goes down: ip route add 10.0.0.0/24 table 100 proto static dev wan0 2. Alternative routes and dead gateway detection Many users create two or more routes, each on different device. They expect that when one of the devices fails the others will be used. After reading the part for "static routes" we know that the second and next route will be used only after the first is removed (automatically from the kernel). We know that this can happen only when the device is marked down or a preferred IP address is removed. Let's assume we have two routes for one network, each with different device, may be connected to different hubs. In such setup we can see that even when our two links are operating properly and we see the other hosts through each device, we use only the first route for outgoing traffic. No problem. The problem comes when link of another host fails and it becomes unreachable through our first route but the second is still working properly. The question is "Can we use all routes to one network according to the neighbour reachability status we observe over these links?". Such solution will allow us to use one route for some of the neighbours and at the same time other routes to other neighbours. We answer these questions with the support for alternative routes and the passive inspection of the neighbours status. May be we already know examples of alternative routes, some of them supported only from this patch: Alternative direct routes: 1> One network on many devices: ifconfig eth0:0 192.168.0.1 ifconfig eth1:0 192.168.0.2 2> Similar setup with "ip": ip route append 10.0.0.0/24 dev eth0 ip route append 10.0.0.0/24 dev eth1 Alternative gatewayed routes: ip route append 10.0.0.0/24 via 192.168.0.1 dev eth0 ip route append 10.0.0.0/24 via 192.168.0.2 dev eth0 ip route append 10.0.0.0/24 via 192.168.1.1 dev eth1 Mixed alternative routes (we have many default routes, each has its preferred source address, some of them are multipath): ip route append default table 100 src 192.168.0.100 \ via 192.168.0.1 dev eth0 ip route append default table 100 src 192.168.1.100 \ nexthop via 192.168.1.1 dev eth0 \ nexthop via 192.168.1.2 dev eth0 ip route append default table 100 src 192.168.2.100 \ dev wan1 May be the most used example (masquerade through many ISPs): WORLD WORLD | | ISP1 ISP2 10.0.1.1 10.0.2.1 \wan0 wan1/ 10.0.1.2 10.0.2.2 NAT ROUTER |192.168.0.1 ----+-------------------+--- Internal Boxes 192.168.0.3 # First, check the link routes by destination # No default routes in table main ip rule add prio 10 table main # Public IP ranges, use strict paths to universe, by source ip rule add prio 20 from 10.0.1.0/24 table 20 ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 20 ip rule add prio 30 from 10.0.2.0/24 table 30 ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 30 # Cycle through all gateways for traffic coming from the # reserved range. Now the NAT-ed hosts can utilize all links ip rule add prio 100 from 192.168.0.0/16 table 100 ip route add default table 100 \ nexthop via 10.0.1.1 dev wan0 \ nexthop via 10.0.2.1 dev wan1 # Select proper source address through each gateway. The locally # generated traffic can select source IP address from the following # routes, by their order and according to the neighbour state ip rule add prio 200 table 200 ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 200 ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 200 The essential requirement is that we use different source IPs when talking through each ISP (10.0.1.2 or 10.0.2.2). We rely on the multipath scheduler to balance the masqueraded traffic. Of course, don't forget the NAT rules for 192.168.0.0/16. In Linux 2.2 we can add NAT rules in two ways: - by adding "nat 0" to the ip rules (in our case we modify our rule): ip rule add prio 100 from 192.168.0.0/16 table 100 nat 0 - by using ipchains command ipchains -A forward -s 192.168.0.0/16 -j MASQ For Linux 2.4 and above the iptables equivalent commands can be used. There is another solution where the router itself can use all links, i.e. this will be possible not only for the internal hosts. For this, we have to use one table for the routes from tables 100 and 200. As result, the setup could be: ip rule add prio 10 table main ip rule add prio 20 from 10.0.1.0/24 table 20 ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 20 ip rule add prio 30 from 10.0.2.0/24 table 30 ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 30 # Don't specify preferred source address here. The kernel after # choosing one path (output device and gateway) will select the # best primary IP address from the output device suitable for the # used gateway ip rule add prio 100 table 100 ip route add default table 100 \ nexthop via 10.0.1.1 dev wan0 \ nexthop via 10.0.2.1 dev wan1 The above solution is suitable only for setups that use gateways from different subnets in the multipath route and when the gateway IP and our local primary IP from the gateway's net are public IP addresses. In all other cases where source IP needs to be different than the primary IP address then the previous solution should be used. How we can summarize these different setups. We usually represent our routing setup as sequence of ip rules matching FROM and/or TO fields and selecting a route table where a list of routes select the output device and optionally nexthop neighbour. Sometimes we use this information only to select a source address for this route - the preferred source address specified in the route. The current Linux versions contain most of the requirements to support alternative routes. In fact, they exist but only for default routes, i.e. for 0/0. The other drawback is that they are used only when the route lookup does not specify output device. As result, the usage is reduced to selecting output default routes for locally generated unbound to any device traffic, i.e. something like: ip route append default via 192.168.0.1 dev eth0 src 192.168.0.2 ip route append default via 192.168.1.1 dev eth0 src 192.168.1.2 At least, we have a proper selection of the source address according to the used route and gateway. This is useful in setups where we must use different IP addresses through each of the gateways. This is better than the other solution when multipath route is used, such as: ip route add default table 100 \ nexthop via 192.168.0.1 dev eth0 \ nexthop via 192.168.1.1 dev eth0 In any case, the selection of one alternative route uses the neighbour reachability state, the device state of each path and other information to select the best path. By this way we can avoid failed devices and failed gateways or to reach a neighbour through another device. The latest is achieved only after applying the patch. As result, the support for alternative routes covers the following things: * alternative routes are those routes that are created in same route table, have equal route destination, prefix length, metric and scope. This is an essential requirement for the reverse path (rp_filter) protection because it avoids reaching less specific routes when looking up by specific output device. * the alternative routes are usually created by using the "ip route append" command as stated in the above examples * the alternative routes are considered when input route (incoming packet) or output route is selected (locally generated packet), with specific or non-specific output device * the alternative routes check the neighbour state not only for gateways but for hosts, i.e. for any kind of neighbours. Note that in some cases the neighbour can remain in reachable state while its nexthops are failed. For example, it is even possible the gateway to be a proxy ARP server and the gateway IP to remain always in reachable state. In such case we can not notice the real state of the gateway's IP. * the alternative routes can be a list from unipath or multipath routes, using NOARP and ARP devices. As result, the first alive or first suspected (but not dead) route is selected by inspecting the state of the gateways in each path or the neighbours through the used device from the path. * as result we take care of the state of each path in a multipath route and we try to use only the alive paths considering their relative weights. That is why kernel patched with support for alternative routes should use user-space tools to refresh the neighbor state in ARP cache. Without such refresh the kernel can select random nexthops from multipath routes (preferring NOARP devices) or from alternative routes. * as result the masquerading can select source address from proper route path by specific output device * as result the reverse path protection (rp_filter) restricts the access for the input traffic only to the devices specified in the routes, not to all devices. We note this especially for the multipath routes. 3. Always use the preferred source address as source in our ARP probes To avoid the inappropriate announcements of one IP address in our ARP probes through many devices we change the policy to always use the preferred source address obtained from the route to the ARP probe recipient. 4. Use the gateway as routing key This helps the masquerading to select the right source address when the packet follows multipath input route with two paths that use same device. The lookup by gateway allows us to hit the same path already attached to the packet. In Linux 2.2 it is used only from masquerading, from the first packet in each connection. In Linux 2.4 it is used in this way: - MASQUERADE: in postrouting when the connection selects the masquerading address - when ipchains and ipfwadm masquerade the first packet in each connection 5. Search for input route by local source IP address (Linux 2.4) New argument is added to ip_route_input: lsrc. It is cached and allows the NAT to specify the external IP address that will be used after routing (in postrouting). By this way the MASQUERADE and DNAT packets that leave the box with new source address (and some SNAT packets) are routed to the right neighbour. For SNAT the first packet can be routed to wrong neighbour so a new match is required that will select the right SNAT rule with the right external address for the output device and the gateway already selected from the input routing. SNAT will work when all paths in the multipath routes have different devices. And as result, no ICMP redirects should be sent when the source and the destination are reachable through same device. 6. Masquerading connection rerouting and incremental checksum updates (Linux 2.2) With the ability to reroute the masqueraded packets for each connection we try to select the right route for each packet when multiple gateways are used. By this way we avoid problems after route changes and the cache is flushed. The patch additionally adds incremental checksum updates for the masqueraded traffic. It is applied only when the addresses and the ports are translated. This includes most of the TCP/UDP traffic for which the payload is not changed. If the masquerading applications change the payload we perform full checksum calculation. Such is the case with the FTP data connections, for example. 7. Netfilter selects the right route (Linux 2.4) The changes include: - key "gw" for the output route calls - find route using the specified gateway additionally to the output device - key "lsrc" for the input route call - use this source IP - IPTables MASQUERADE target: At POSTROUTING when the first packet for the connection comes we selects the right masquerading address for the output device and the gateway bound to the packet at routing time. All next packets for these connections are routed in in->out direction from a new PRE_ROUTING hook function (with last priority, just before the input routing) by using the lsrc argument to ip_route_input. By this way we do something like output route but with most of the checks for input route. - IPTables DNAT target: We perform the same manipulations as for MASQUERADE: the NAT information is already initialized at PRE_ROUTING time and at routing time we know what is scheduled for POST_ROUTING as source address manipulation. We use the same call to ip_route_input to select the right route for the already bound masquerade (external) address in in->out direction. - IPTables SNAT target: The problem here is worse: all packets leave the box after their source address is changed with a predefined masquerade address. The problem comes from routing time: we don't know what SNAT rule will be selected (this job is scheduled for POST_ROUTING time), so we don't know the masquerade (external) address. Without mechanism to select the SNAT rule (at POST_ROUTING) by matching the already selected gateway at routing time (the output device is not enough) it is possible we to select the wrong route when we have two paths using same device (but different gateways which is not noticed from the lookup for SNAT rule). There is no problem if all paths from the used multipath route use different devices. Without adding a match by gateway attached to the packet's route, the using of SNAT rules with multiple gateways on same output device is dangerous (when multipath routes are used). - IPChains/IPfwadm masquerade: For the first packet the ipchains and ipfwadm masquerade select the right masquerading address for the output device and the gateway bound to the packet at routing time. For the next packets from the connection we always make a output route call and check whether the route is changed. We try to reroute the packet if needed (if the routing selects wrong path from multipath route). This is possible considering the fact that the routing cache entries expire after period of time or when the routing rules are changed from the user. The packet processing is slower but guarantees correct path selection. If performance is needed and the above output routing call is blamed for the delays, there are other methods for masquerading that select the right route without such drawback (MASQUERADE for example). Note: we rely on the fact that there are no other functions in the FORWARD hook after us because sometimes we change the output device and this is not noticed from the netfilter code that walks the chain. Stupid hack, no more words. - It is recommended that all SNAT/MASQUERADE/ipchains rules to contain the output device specified. It could be fatal for multipath routes with paths with distinct public IP ranges if such setting is missing. OTOH, this is the only difference we can specify for all SNAT rules (until a match by gateway is implemented) to match each of the paths.