Linux can do bridging if you compile support for it in the kernel. Most distributions will require the download and compile of bridge control tools from http://bridge.sourceforge.net/.
So, you've just setup your brand new, shiny gigabit network, turned on jumbo frames are are about to impress your friends with your massive network throughput. Instead you get pages of scarry kernel messages starting that look like
Oct 25 23:35:21 matrix kernel: swapper: page allocation failure. order:2, mode:0x20 Oct 25 23:35:21 matrix kernel: Oct 25 23:35:21 matrix kernel: Call Trace: Oct 25 23:35:21 matrix kernel:[ ] __alloc_pages+0x288/0x2a1 Oct 25 23:35:21 matrix kernel: [ ] tcp_v4_rcv+0x826/0x89b Oct 25 23:35:21 matrix kernel: [ ] cache_alloc_refill+0x23f/0x45e Oct 25 23:35:21 matrix kernel: [ ] __kmalloc+0x50/0x57 Oct 25 23:35:21 matrix kernel: [ ] __alloc_skb+0x5a/0x133 Oct 25 23:35:21 matrix kernel: [ ] :forcedeth:nv_alloc_rx+0x62/0x182 Oct 25 23:35:21 matrix kernel: [ ] :forcedeth:nv_nic_irq+0x12c/0x1d7 Oct 25 23:35:21 matrix kernel: [ ] handle_IRQ_event+0x25/0x53 Oct 25 23:35:21 matrix kernel: [ ] __do_softirq+0x46/0x90 Oct 25 23:35:21 matrix kernel: [ ] handle_fasteoi_irq+0x54/0x81 Oct 25 23:35:21 matrix kernel: [ ] do_IRQ+0x60/0xb4 Oct 25 23:35:21 matrix kernel: [ ] default_idle+0x0/0x3a Oct 25 23:35:21 matrix kernel: [ ] ret_from_intr+0x0/0xa Oct 25 23:35:21 matrix kernel: [ ] default_idle+0x26/0x3a Oct 25 23:35:21 matrix kernel: [ ] cpu_idle+0x3d/0x5c Oct 25 23:35:21 matrix kernel: [ ] start_kernel+0x220/0x225 Oct 25 23:35:21 matrix kernel: [ ] _sinittext+0x140/0x144 Oct 25 23:35:21 matrix kernel: Oct 25 23:35:21 matrix kernel: Mem-info: Oct 25 23:35:21 matrix kernel: DMA per-cpu: Oct 25 23:35:21 matrix kernel: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Oct 25 23:35:21 matrix kernel: DMA32 per-cpu: Oct 25 23:35:21 matrix kernel: CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 60 Oct 25 23:35:21 matrix kernel: Active:226994 inactive:186494 dirty:53335 writeback:2013 unstable:0 free:2610 slab:89074 mapped:7146 pagetables:2252 Oct 25 23:35:21 matrix kernel: DMA free:8032kB min:16kB low:20kB high:24kB active:4kB inactive:120kB present:6640kB pages_scanned:73 all_unreclaimable? no Oct 25 23:35:21 matrix kernel: lowmem_reserve[]: 0 2004 2004 Oct 25 23:35:21 matrix kernel: DMA32 free:2408kB min:5716kB low:7144kB high:8572kB active:907972kB inactive:745856kB present:2052260kB pages_scanned:0 all_unreclaimable? no Oct 25 23:35:21 matrix kernel: lowmem_reserve[]: 0 0 0 Oct 25 23:35:21 matrix kernel: DMA: 4*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8032kB Oct 25 23:35:21 matrix kernel: DMA32: 38*4kB 214*8kB 0*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2408kB Oct 25 23:35:21 matrix kernel: Swap cache: add 2483168, delete 2452242, find 12954858/13171050, race 0+0 Oct 25 23:35:21 matrix kernel: Free swap = 1627108kB Oct 25 23:35:21 matrix kernel: Total swap = 1951800kB Oct 25 23:35:21 matrix kernel: Free swap: 1627108kB Oct 25 23:35:21 matrix kernel: 524272 pages of RAM Oct 25 23:35:21 matrix kernel: 8870 reserved pages Oct 25 23:35:21 matrix kernel: 186226 pages shared Oct 25 23:35:21 matrix kernel: 30926 pages swap cached
First, don't panic. These are non-fatal errors. Linux networking still sucks. Basicly, without too much effort, you can transmit more data than you can recieve, and this is your computer running out of memory for incoming packets.
For a thorough explanation, see this document Which explains the whys and gives tuning recommendations for their application. And also Enabling High Performance Data Transfers For my own setup, I'm using
/etc/sysctl.conf: vm.min_free_kbytes=32768 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack=0 net.core.rmem_max=1048576 /etc/rc2.d/S99local: ethtool -G eth1 rx 4096 /etc/modutils/local options e1000 RxDescriptors=512 /etc/network/interfaces mtu 9000
The other parameters mentioned in the paper were already larger than recommended. The default RxDescriptors is 256, and each one gets a data buffer which is a power of 2 size large enouch for the MTU.
Further tuning to increase throughput. I also turned sack and timestamps back on to make sure window sizing works, but when two hosts are sending data to matrix, it still gets those errors. Is this from a regression on matrix or the other hosts getting faster?
It's the former. Reverting to my original tuning made the problems go away. Also, when I started running xen on this box, I had to add swiotlb=128 to get rid of constant 'Out of SW-IOMMU space' messages for each frame recieved.
# increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # set max to at least 4MB, or higher if you use very high BDP paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500 # for 10 GigE, use this # net.core.netdev_max_backlog = 30000
Tony Lill, Tony.Lill@AJLC.Waterloo.ON.CA President, A. J. Lill Consultants (519) 241 2461 539 Grand Valley Dr., Cambridge, Ont. fax/data (519) 650 3571 "Welcome to All Things UNIX, where if it's not UNIX, it's CRAP!"