Networking

Bridging setup

Linux can do bridging if you compile support for it in the kernel. Most distributions will require the download and compile of bridge control tools from http://bridge.sourceforge.net/.

Gigabit

So, you've just setup your brand new, shiny gigabit network, turned on jumbo frames are are about to impress your friends with your massive network throughput. Instead you get pages of scarry kernel messages starting that look like

Oct 25 23:35:21 matrix kernel: swapper: page allocation failure. order:2, mode:0x20
Oct 25 23:35:21 matrix kernel:
Oct 25 23:35:21 matrix kernel: Call Trace:
Oct 25 23:35:21 matrix kernel:   [] __alloc_pages+0x288/0x2a1
Oct 25 23:35:21 matrix kernel: [] tcp_v4_rcv+0x826/0x89b
Oct 25 23:35:21 matrix kernel: [] cache_alloc_refill+0x23f/0x45e
Oct 25 23:35:21 matrix kernel: [] __kmalloc+0x50/0x57
Oct 25 23:35:21 matrix kernel: [] __alloc_skb+0x5a/0x133
Oct 25 23:35:21 matrix kernel: [] :forcedeth:nv_alloc_rx+0x62/0x182
Oct 25 23:35:21 matrix kernel: [] :forcedeth:nv_nic_irq+0x12c/0x1d7
Oct 25 23:35:21 matrix kernel: [] handle_IRQ_event+0x25/0x53
Oct 25 23:35:21 matrix kernel: [] __do_softirq+0x46/0x90
Oct 25 23:35:21 matrix kernel: [] handle_fasteoi_irq+0x54/0x81
Oct 25 23:35:21 matrix kernel: [] do_IRQ+0x60/0xb4
Oct 25 23:35:21 matrix kernel: [] default_idle+0x0/0x3a
Oct 25 23:35:21 matrix kernel: [] ret_from_intr+0x0/0xa
Oct 25 23:35:21 matrix kernel:   [] default_idle+0x26/0x3a
Oct 25 23:35:21 matrix kernel: [] cpu_idle+0x3d/0x5c
Oct 25 23:35:21 matrix kernel: [] start_kernel+0x220/0x225
Oct 25 23:35:21 matrix kernel: [] _sinittext+0x140/0x144
Oct 25 23:35:21 matrix kernel:
Oct 25 23:35:21 matrix kernel: Mem-info:
Oct 25 23:35:21 matrix kernel: DMA per-cpu:
Oct 25 23:35:21 matrix kernel: CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Oct 25 23:35:21 matrix kernel: DMA32 per-cpu:
Oct 25 23:35:21 matrix kernel: CPU    0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:  60
Oct 25 23:35:21 matrix kernel: Active:226994 inactive:186494 dirty:53335 writeback:2013 unstable:0 free:2610 slab:89074 mapped:7146 pagetables:2252
Oct 25 23:35:21 matrix kernel: DMA free:8032kB min:16kB low:20kB high:24kB active:4kB inactive:120kB present:6640kB pages_scanned:73 all_unreclaimable? no
Oct 25 23:35:21 matrix kernel: lowmem_reserve[]: 0 2004 2004
Oct 25 23:35:21 matrix kernel: DMA32 free:2408kB min:5716kB low:7144kB high:8572kB active:907972kB inactive:745856kB present:2052260kB pages_scanned:0 all_unreclaimable? no
Oct 25 23:35:21 matrix kernel: lowmem_reserve[]: 0 0 0
Oct 25 23:35:21 matrix kernel: DMA: 4*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8032kB
Oct 25 23:35:21 matrix kernel: DMA32: 38*4kB 214*8kB 0*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2408kB
Oct 25 23:35:21 matrix kernel: Swap cache: add 2483168, delete 2452242, find 12954858/13171050, race 0+0
Oct 25 23:35:21 matrix kernel: Free swap  = 1627108kB
Oct 25 23:35:21 matrix kernel: Total swap = 1951800kB
Oct 25 23:35:21 matrix kernel: Free swap:       1627108kB
Oct 25 23:35:21 matrix kernel: 524272 pages of RAM
Oct 25 23:35:21 matrix kernel: 8870 reserved pages
Oct 25 23:35:21 matrix kernel: 186226 pages shared
Oct 25 23:35:21 matrix kernel: 30926 pages swap cached

First, don't panic. These are non-fatal errors. Linux networking still sucks. Basicly, without too much effort, you can transmit more data than you can recieve, and this is your computer running out of memory for incoming packets.

For a thorough explanation, see this document Which explains the whys and gives tuning recommendations for their application. And also Enabling High Performance Data Transfers For my own setup, I'm using

/etc/sysctl.conf:
vm.min_free_kbytes=32768
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.core.rmem_max=1048576

/etc/rc2.d/S99local:
ethtool -G eth1 rx 4096

/etc/modutils/local
options e1000 RxDescriptors=512

/etc/network/interfaces
mtu 9000

The other parameters mentioned in the paper were already larger than recommended. The default RxDescriptors is 256, and each one gets a data buffer which is a power of 2 size large enouch for the MTU.

Further tuning to increase throughput. I also turned sack and timestamps back on to make sure window sizing works, but when two hosts are sending data to matrix, it still gets those errors. Is this from a regression on matrix or the other hosts getting faster?

It's the former. Reverting to my original tuning made the problems go away. Also, when I started running xen on this box, I had to add swiotlb=128 to get rid of constant 'Out of SW-IOMMU space' messages for each frame recieved.

# increase TCP max buffer size setable using setsockopt()
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB, or higher if you use very high BDP paths
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
# recommended to increase this for 1000 BT or higher
net.core.netdev_max_backlog = 2500
# for 10 GigE, use this
# net.core.netdev_max_backlog = 30000   

Tony Lill,                         Tony.Lill@AJLC.Waterloo.ON.CA
President, A. J. Lill Consultants                 (519) 241 2461
539 Grand Valley Dr., Cambridge, Ont.    fax/data (519) 650 3571

"Welcome to All Things UNIX, where if it's not UNIX, it's CRAP!"