Alternative Routes and Dead Gateway Detection for Linux

		Julian Anastasov <ja@ssi.bg>, October 2001


	CONTENTS:

	0. History and Introduction
	1. Static Routes
	2. Alternative routes and dead gateway detection
	3. Always  use the preferred source address as source in our ARP
probes
	4. Use the gateway as routing key
	5. Search for input route by local source IP address (Linux 2.4)
	6. Masquerading  connection  rerouting and  incremental checksum
updates (Linux 2.2)
	7. Netfilter selects the right route (Linux 2.4)


	0. History and Introduction

29-NOV-2002

	- New patches for 2.4.20

		05_nf_reroute-2.4.20-9.diff
		routes-2.4.20-9.diff

04-MAY-2002

	- New patches for 2.4.19 (actually against pre8):

		00_static_routes-2.4.19-8.diff
		01_alt_routes-2.4.19-8.diff
		05_nf_reroute-2.4.19-8.diff
		routes-2.4.19-8.diff

03-FEB-2002

	- New patches for 2.2:

		routes-2.2.20-7.diff
		01_alt_routes-2.2.20-7.diff
		02_masq_csum_reroute-2.2.20-7.diff
		05_key_gw-2.2.20-7.diff
		routes-2.2.20-IPVS-1.0.8-7.diff
		02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff

14-DEC-2001

	- New patches for 2.4 and 2.2:

		routes-2.4.16-6.diff
		routes-hidden-forward_shared-noarp-2.4.16-2.diff
		00_static_routes-2.4.16-6.diff
		routes-2.2.20-6.diff
		00_static_routes-2.2.20-6.diff

24-NOV-2001

	- New patches
		05_nf_reroute-2.4.14-5pre2.diff replaces pre1
		routes-2.4.14-5.diff

16-NOV-2001

	- New patch
		05_nf_reroute-2.4.14-5pre1.diff

10-NOV-2001

	- New patches
		02_masq_csum_reroute-2.2.20-IPVS-1.0.8-4.diff
		06_key_gw-2.2.20-IPVS-1.0.8-4.diff
		Note: IPVS 1.0.8 is applied before all these patches

24-OCT-2001
	- New patches added:
		00_static_routes-2.4.12-5.diff
		01_alt_routes-2.4.12-5.diff
		01_arp_prefsrc-2.4.12-5.diff

14-OCT-2001
	- New patches added, now routes-2.2.19-4.diff contains:
		00_static_routes-2.2.19-4.diff
		01_alt_routes-2.2.19-4.diff
		01_arp_prefsrc-2.2.19-4.diff
		02_masq_csum_reroute-2.2.19-4.diff
		05_key_gw-2.2.19-4.diff

11-OCT-2001

	- Document created


	This  document contains information  about extending the routing
functionality  in Linux  to support  alternative routes  and better dead
gateway detection.

	The current status of these extensions is:

- single patch against 2.2.19+, completed, publicly available

- single patch against 2.4.16+, completed(mostly), publicly available

	The patch includes the following independent parts:

- static routes

- alternative routes and dead gateway/neighbour detection

- always use prefsrc as source address in our ARP probes

- masquerading connection rerouting and incremental checksum updates

- use the gateway as routing key

- define local source IP address for the input routes, used from NAT

Patches for Linux 2.2:

one of:

- for non-LVS users: routes-2.2.20-7.diff
- for LVS users: routes-2.2.20-IPVS-1.0.8-7.diff

or if applied one by one:

00_static_routes-2.2.20-6.diff

01_alt_routes-2.2.20-7.diff

01_arp_prefsrc-2.2.19-4.diff

02_masq_csum_reroute-2.2.20-7.diff
	Requires 01_alt_routes-2.2.20-7.diff
	Not used from the LVS users

02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff
	Used from the LVS users instead of 02_masq_csum_reroute-2.2.20-7.diff
	Requires 01_alt_routes-2.2.20-7.diff

05_key_gw-2.2.20-7.diff
	Requires 00_static_routes-2.2.20-6.diff, 01_alt_routes-2.2.20-7.diff,
	01_arp_prefsrc-2.2.19-4.diff and one of the
	02_masq_csum_reroute-2.2.20-7.diff or
	02_masq_csum_reroute-2.2.20-IPVS-1.0.8-7.diff

06_key_gw-2.2.20-IPVS-1.0.8-4.diff
	Used from the LVS users only

Patches for Linux 2.4:

00_static_routes-2.4.19-8.diff
01_alt_routes-2.4.19-8.diff
01_arp_prefsrc-2.4.12-5.diff

05_nf_reroute-2.4.20-9.diff
	Requires 00_static_routes-2.4.19-8.diff,
	01_alt_routes-2.4.19-8.diff and 01_arp_prefsrc-2.4.12-5.diff
	Job: new routing keys (gw, lsrc), netfilter prerouting/rerouting


	1. Static routes

	The  current Linux  versions distinguish  the routes  by the way
they  are  installed and  maintained.  In  the include/linux/rtnetlink.h
header  file we  can see  types such  as "KERNEL",  "BOOT", "STATIC" and
other  types  mostly identifying  different  routing daemons.   For type
"STATIC"   we  can  see  a  description  such  as  "Route  installed  by
administrator"  and later "Values  of protocol >=  RTPROT_STATIC are not
interpreted  by kernel". By default, the kernel creates routes from type
"KERNEL"  and "BOOT".  The "STATIC" routes  are not touched (may be only
from  Gated).  The interesting part is  that some "KERNEL" routes can be
deleted  when  a  device's  status  is  changed  and  of  course created
automatically  when the  device goes  up. But what  we see  is that only
"KERNEL"  routes are recreated automatically from the kernel. The reason
for this is that the kernel manages them as result of adding or deleting
of IP addresses.  The problem we see in some situations is that we can't
create  route  that  can  survive device  state  changes.   This happens
because  the routes we create  are from type "BOOT"  by default and they
can't  be recreated from the kernel after  they are removed. You can ask
"Why they are deleted and not marked dead?". Good question. There can be
many routes in the system.

	The question we answer with this patch is "Can the administrator
install  route that can survive device  state changes?". To support such
routes  we add code in  the kernel to handle  routes from type "STATIC".
We  assume that  such routes  are not  changed from  the routing daemons
because  they use their own  route type. So, we  can safely add "STATIC"
routes  while the  device they are  using is down,  probably caused from
the device driver trying to tell us that we should avoid sending traffic
at the moment.

	The  proposed support  for static  routes handles  the following
situations:

- when  a static  route is  created each path  defined in  this route is
checked  and if the used device is down the path is marked as dead.  The
current kernel refuses creating such routes when the device is down.

- when  new address is added (followed by creating a "KERNEL" link route
that  validates gateway in our static route) or device state is changed,
each  path in all static routes are checked and probably marked alive if
their gateway becomes reachable.

- when a device goes down all paths in static routes that use the device
are  marked as dead. The current  kernel versions remove all routes with
all paths marked dead. In our additions we don't allow it.

- it  is recommended that when device  is unregistered the static routes
that  use this device to be  recreated by the administrator (by removing
the  paths containing this  device). The current  device scheme does not
allow  marking as alive  the paths in  statuc routes if  a newly created
device  has the same device  name as the old  unregistered one. When all
paths  in multipath static  route contain unregistered  devices then the
route is automatically removed.

- as  result,  the  static  routes are  preserved  as  much  as possible
according  to the presence  and state of the  used devices and preferred
source IP addresses

	The examples:

	With  "proto static"  the following  command succeeds  even when
device ppp0 is DOWN:

	ip route add 10.0.0.0/24 table 100 proto static \
		nexthop dev ppp0 \
		nexthop via 192.168.0.1 dev eth0


	With  "proto  static" the  following route  will not  be deleted
when device wan0 goes down:

	ip route add 10.0.0.0/24 table 100 proto static dev wan0


	2. Alternative routes and dead gateway detection

	Many  users create two or more routes, each on different device.
They  expect that when one of the devices fails the others will be used.
After  reading the part for "static routes"  we know that the second and
next  route will be used only  after the first is removed (automatically
from  the kernel). We know that this  can happen only when the device is
marked  down or a preferred IP address  is removed. Let's assume we have
two routes for one network, each with different device, may be connected
to  different hubs.   In such setup  we can  see that even  when our two
links  are operating  properly and we  see the other  hosts through each
device,  we use only  the first route for  outgoing traffic. No problem.
The  problem  comes  when link  of  another  host fails  and  it becomes
unreachable  through our  first route  but the  second is  still working
properly.

	The  question is "Can we use all routes to one network according
to  the  neighbour reachability  status we  observe over  these links?".
Such  solution will allow us to use one route for some of the neighbours
and  at the same time other routes to other neighbours.  We answer these
questions  with  the  support  for alternative  routes  and  the passive
inspection of the neighbours status.

	May be we already know examples of alternative routes, some
of them supported only from this patch:

Alternative direct routes:

	1> One network on many devices:

	ifconfig eth0:0 192.168.0.1
	ifconfig eth1:0 192.168.0.2

	2> Similar setup with "ip":

	ip route append 10.0.0.0/24 dev eth0
	ip route append 10.0.0.0/24 dev eth1

Alternative gatewayed routes:

ip route append 10.0.0.0/24 via 192.168.0.1 dev eth0
ip route append 10.0.0.0/24 via 192.168.0.2 dev eth0
ip route append 10.0.0.0/24 via 192.168.1.1 dev eth1

	Mixed alternative routes (we have many default routes, each
has its preferred source address, some of them are multipath):

ip route append default table 100 src 192.168.0.100 \
	via 192.168.0.1 dev eth0
ip route append default table 100 src 192.168.1.100 \
	nexthop via 192.168.1.1 dev eth0 \
	nexthop via 192.168.1.2 dev eth0
ip route append default table 100 src 192.168.2.100 \
	dev wan1


	May be the most used example (masquerade through many ISPs):

WORLD				WORLD
  |				  |
ISP1				ISP2
10.0.1.1			10.0.2.1
	\wan0		   wan1/
	10.0.1.2	10.0.2.2
		NAT ROUTER
		    |192.168.0.1
		----+-------------------+---
			Internal Boxes	192.168.0.3 

# First, check the link routes by destination
# No default routes in table main
ip rule add prio 10 table main

# Public IP ranges, use strict paths to universe, by source
ip rule add prio 20 from 10.0.1.0/24 table 20
ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 20
ip rule add prio 30 from 10.0.2.0/24 table 30
ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 30

# Cycle through all gateways for traffic coming from the
# reserved range. Now the NAT-ed hosts can utilize all links
ip rule add prio 100 from 192.168.0.0/16 table 100
ip route add default table 100 \
	nexthop via 10.0.1.1 dev wan0 \
	nexthop via 10.0.2.1 dev wan1

# Select proper source address through each gateway. The locally
# generated traffic can select source IP address from the following
# routes, by their order and according to the neighbour state
ip rule add prio 200 table 200
ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 200
ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 200

	The  essential requirement is  that we use  different source IPs
when  talking through  each ISP (10.0.1.2  or 10.0.2.2). We  rely on the
multipath scheduler to balance the masqueraded traffic.

	Of  course, don't forget  the NAT rules  for 192.168.0.0/16.  In
Linux 2.2 we can add NAT rules in two ways:

- by adding "nat 0" to the ip rules (in our case we modify our rule):

	ip rule add prio 100 from 192.168.0.0/16 table 100 nat 0

- by using ipchains command

	ipchains -A forward -s 192.168.0.0/16 -j MASQ

	For  Linux 2.4 and above the iptables equivalent commands can be
used.

	There  is another solution  where the router  itself can use all
links,  i.e. this will be possible not only for the internal hosts.  For
this,  we have to use one table for  the routes from tables 100 and 200.
As result, the setup could be:

ip rule add prio 10 table main

ip rule add prio 20 from 10.0.1.0/24 table 20
ip route append default via 10.0.1.1 dev wan0 src 10.0.1.2 table 20
ip rule add prio 30 from 10.0.2.0/24 table 30
ip route append default via 10.0.2.1 dev wan1 src 10.0.2.2 table 30

# Don't specify preferred source address here. The kernel after
# choosing one path (output device and gateway) will select the
# best primary IP address from the output device suitable for the
# used gateway
ip rule add prio 100 table 100
ip route add default table 100 \
	nexthop via 10.0.1.1 dev wan0 \
	nexthop via 10.0.2.1 dev wan1

	The  above  solution  is  suitable  only  for  setups  that  use
gateways  from different  subnets in  the multipath  route and  when the
gateway  IP and our local  primary IP from the  gateway's net are public
IP  addresses. In all other cases where  source IP needs to be different
than the primary IP address then the previous solution should be used.

	How  we  can  summarize  these  different  setups.   We  usually
represent our routing setup as sequence of ip rules matching FROM and/or
TO  fields and selecting a route table where a list of routes select the
output  device and optionally  nexthop neighbour. Sometimes  we use this
information  only  to  select a  source  address  for this  route  - the
preferred source address specified in the route.

	The  current Linux versions contain  most of the requirements to
support  alternative routes.  In  fact, they exist  but only for default
routes, i.e. for 0/0. The other drawback is that they are used only when
the  route lookup does not specify  output device.  As result, the usage
is  reduced  to selecting  output default  routes for  locally generated
unbound to any device traffic, i.e. something like:

ip route append default via 192.168.0.1 dev eth0 src 192.168.0.2
ip route append default via 192.168.1.1 dev eth0 src 192.168.1.2

	At  least,  we have  a proper  selection  of the  source address
according  to the used route and gateway. This is useful in setups where
we must use different IP addresses through each of the gateways. This is
better than the other solution when multipath route is used, such as:

ip route add default table 100 \
	nexthop via 192.168.0.1 dev eth0 \
	nexthop via 192.168.1.1 dev eth0

	In  any case,  the selection of  one alternative  route uses the
neighbour  reachability state, the  device state of  each path and other
information  to select the  best path. By  this way we  can avoid failed
devices  and failed  gateways or  to reach  a neighbour  through another
device. The latest is achieved only after applying the patch.

	As  result,  the  support  for  alternative  routes  covers  the
following things:

* alternative  routes are  those routes that  are created  in same route
table,  have equal route  destination, prefix length,  metric and scope.
This  is  an  essential  requirement for  the  reverse  path (rp_filter)
protection  because it avoids reaching less specific routes when looking
up by specific output device.

* the  alternative routes  are usually  created by  using the  "ip route
append" command as stated in the above examples

* the  alternative  routes  are considered  when  input  route (incoming
packet)  or output  route is  selected (locally  generated packet), with
specific or non-specific output device

* the alternative routes check the neighbour state not only for gateways
but  for hosts, i.e. for any kind of neighbours. Note that in some cases
the  neighbour  can remain  in reachable  state  while its  nexthops are
failed.   For example, it is even possible the gateway to be a proxy ARP
server  and the gateway IP to remain  always in reachable state. In such
case we can not notice the real state of the gateway's IP.

* the alternative routes can be a list from unipath or multipath routes,
using  NOARP  and  ARP devices.  As  result,  the first  alive  or first
suspected  (but not dead)  route is selected by  inspecting the state of
the gateways in each path or the neighbours through the used device from
the path.

* as  result we take care of the state of each path in a multipath route
and  we  try to  use  only the  alive  paths considering  their relative
weights.  That is why kernel patched with support for alternative routes
should  use user-space tools to refresh the neighbor state in ARP cache.
Without  such  refresh  the  kernel  can  select  random  nexthops  from
multipath routes (preferring NOARP devices) or from alternative routes.

* as  result  the masquerading  can  select source  address  from proper
route path by specific output device

* as  result  the  reverse  path  protection  (rp_filter)  restricts the
access  for  the input  traffic  only to  the  devices specified  in the
routes,  not to all  devices. We note this  especially for the multipath
routes.


	3. Always  use the preferred source address as source in our ARP
probes

	To  avoid the inappropriate  announcements of one  IP address in
our  ARP probes through many devices we  change the policy to always use
the  preferred source address  obtained from the route  to the ARP probe
recipient.


	4. Use the gateway as routing key

	This  helps the masquerading to  select the right source address
when  the packet follows  multipath input route with  two paths that use
same  device.  The lookup  by gateway  allows  us to  hit the  same path
already attached to the packet.

	In  Linux 2.2 it is used  only from masquerading, from the first
packet in each connection.

	In Linux 2.4 it is used in this way:

- MASQUERADE:   in   postrouting   when  the   connection   selects  the
masquerading address

- when  ipchains  and  ipfwadm  masquerade  the  first  packet  in  each
connection


	5. Search for input route by local source IP address (Linux 2.4)

	New  argument is added to ip_route_input: lsrc. It is cached and
allows  the NAT  to specify  the external IP  address that  will be used
after  routing (in  postrouting). By  this way  the MASQUERADE  and DNAT
packets  that  leave the  box  with new  source  address (and  some SNAT
packets)  are routed to  the right neighbour. For  SNAT the first packet
can  be routed to wrong  neighbour so a new  match is required that will
select  the  right SNAT  rule with  the right  external address  for the
output  device and the gateway already  selected from the input routing.
SNAT  will work  when all paths  in the multipath  routes have different
devices.   And  as result,  no ICMP  redirects should  be sent  when the
source and the destination are reachable through same device.


	6. Masquerading  connection  rerouting and  incremental checksum
updates (Linux 2.2)

	With  the ability  to reroute  the masqueraded  packets for each
connection  we  try  to select  the  right  route for  each  packet when
multiple  gateways are used.  By this way we  avoid problems after route
changes and the cache is flushed.

	The patch additionally adds incremental checksum updates for the
masqueraded traffic. It is applied only when the addresses and the ports
are  translated. This includes most of the TCP/UDP traffic for which the
payload  is not  changed.  If  the masquerading  applications change the
payload  we perform full checksum calculation. Such is the case with the
FTP data connections, for example.


	7. Netfilter selects the right route (Linux 2.4)

	The changes include:

- key  "gw" for the output route calls  - find route using the specified
gateway additionally to the output device

- key "lsrc" for the input route call - use this source IP

- IPTables MASQUERADE target:

		At  POSTROUTING when the first packet for the connection
	comes  we selects the right  masquerading address for the output
	device  and the gateway bound to the packet at routing time. All
	next  packets  for  these  connections  are  routed  in  in->out
	direction  from  a  new  PRE_ROUTING  hook  function  (with last
	priority,  just  before the  input  routing) by  using  the lsrc
	argument  to ip_route_input.  By this  way we  do something like
	output route but with most of the checks for input route.

- IPTables DNAT target:

		We  perform  the same  manipulations as  for MASQUERADE:
	the  NAT information is already  initialized at PRE_ROUTING time
	and  at routing time we know  what is scheduled for POST_ROUTING
	as  source  address  manipulation.  We  use  the  same  call  to
	ip_route_input  to select the right  route for the already bound
	masquerade (external) address in in->out direction.

- IPTables SNAT target:

		The  problem here  is worse:  all packets  leave the box
	after   their  source  address  is  changed  with  a  predefined
	masquerade  address.  The  problem comes  from routing  time: we
	don't  know  what  SNAT  rule  will  be  selected  (this  job is
	scheduled   for  POST_ROUTING  time),  so   we  don't  know  the
	masquerade  (external) address. Without  mechanism to select the
	SNAT  rule (at  POST_ROUTING) by  matching the  already selected
	gateway  at routing time (the output device is not enough) it is
	possible  we to  select the wrong  route when we  have two paths
	using  same device (but different  gateways which is not noticed
	from the lookup for SNAT rule). There is no problem if all paths
	from the used multipath route use different devices.

		Without  adding  a  match  by  gateway  attached  to the
	packet's  route, the using of  SNAT rules with multiple gateways
	on  same output device  is dangerous (when  multipath routes are
	used).

- IPChains/IPfwadm masquerade:

		For the first packet the ipchains and ipfwadm masquerade
	select  the right masquerading address for the output device and
	the  gateway bound to  the packet at routing  time. For the next
	packets  from the connection we always  make a output route call
	and  check whether the  route is changed. We  try to reroute the
	packet  if  needed  (if  the  routing  selects  wrong  path from
	multipath  route).  This  is possible considering  the fact that
	the  routing cache entries  expire after period  of time or when
	the  routing  rules  are  changed  from  the  user.   The packet
	processing  is slower but guarantees  correct path selection. If
	performance  is  needed and  the  above output  routing  call is
	blamed  for the delays, there are other methods for masquerading
	that  select the  right route without  such drawback (MASQUERADE
	for example).

	Note:  we rely on the fact that  there are no other functions in
	the FORWARD hook after us because sometimes we change the output
	device  and this  is not  noticed from  the netfilter  code that
	walks the chain. Stupid hack, no more words.

- It  is recommended that all  SNAT/MASQUERADE/ipchains rules to contain
the output device specified. It could be fatal for multipath routes with
paths  with distinct public IP ranges  if such setting is missing. OTOH,
this  is the only difference we can  specify for all SNAT rules (until a
match by gateway is implemented) to match each of the paths.