The background and the drawback

So TCP Window Scaling is causing the problem... Or rather, the communication between the client and server has a problem with TCP Window scaling. It may be related to the broken router problem described by Corbet in 2004. And there was an issue with the 2.6.17 kernel in 2006 described in Window Scaling on the Internet. But we are speaking 2010.

Investigation shows that indeed the SYN packet sent by the Linux box, which has the tcp window scaling option set with a factor of 128 (2^7), is anwered by a SYN,ACK from www.sciencedirect.com with the tcp window scaling option set, but with a factor of 1 (2^0):

Example 1.  'Normal' behaviour: www.oulu.fi accepts and returns tcp window scaling factor of 128 (2^7)

00:00:00.000000 00:23:7d:1b:f7:c1 > 00:00:0c:07:ac:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 40897, offset 0, flags [DF], proto TCP (6), length 60)
129.125.51.98.44732 > 130.231.10.10.80: Flags [S], cksum 0x7d89 (correct), seq 3999414352, win 5840, options [mss 1460,sackOK,TS val 7710609 ecr 0,nop,wscale 7], length 0
00:00:00.049967 00:d0:00:97:4c:00 > 00:23:7d:1b:f7:c1, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 52, id 0, offset 0, flags [DF], proto TCP (6), length 60)
130.231.10.10.80 > 129.125.51.98.44732: Flags [S.], cksum 0xa180 (correct), seq 2854207395, ack 3999414353, win 5792, options [mss 1460,sackOK,TS val 1535383264 ecr 7710609,nop,wscale 7], length 0


www.student.utwente.nl behaves like this too. Both higher and lower returns exist as well: google sets the scaling to 64 (2^6), but this causes no retransmits, packages.gentoo.org returns scaling of 512 (2^9).

Example 2.  'Strange' behaviour: www.sciencedirect.com returns window scaling with factor 1 (2^0) but option still on, then resends

129.125.51.98.42132 > 198.81.200.2.80: Flags [S], cksum 0x712a (correct), seq 1890434725, win 5840, options [mss 1460,sackOK,TS val 7779348 ecr 0,nop,wscale 7], length 0
00:11:27.520515 00:d0:00:97:4c:00 > 00:23:7d:1b:f7:c1, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 242, id 8832, offset 0, flags [DF], proto TCP (6), length 64)
198.81.200.2.80 > 129.125.51.98.42132: Flags [S.], cksum 0xe5c3 (correct), seq 1867760777, ack 1890434726, win 4380, options [mss 1460,nop,wscale 0,nop,nop,TS val 3605689667 ecr 7779348,sackOK,eol], length 0


Figure 1.  routes to www.sciencedirect.com

Four routes to www.sciencedirect.com

In , we show four routes from test machines to www.sciencedirect.com. Or rather, from two hops away from the test machines to give them some privacy, to one hop away from www.sciencedirect.com (as confirmed with tcptraceroute later. Routes that get SYN,ACK with wscale 7 have the leaf (end) node in green, those with wscale 0 are red.

As can be seen, the one system that gets proper response has a unique route until one hop before the server, but so has the rightmost sytem, which gets a wrong response. In addition, I'm not quite certain whether the test clients are not actually talking to different servers which just respond to the same IP through some load balancing construction.

The problem is not in our network, since from a host in Sweden we see the same problem. It isn't in the OS either, since Windows 7 suffers the same problem (only with less severe symptoms, since it scales only by a factor 4 (2^2)). But it might be in the route to the server, as another machine elsewhere in The Netherlands with Ubuntu Linux (and a newer kernel) talks to the server using a TCP Window scaling factor of 128 with no trouble. Upgrading our test machine to the same Ubuntu version doesn't fix the problem though, so the problem may still be in the route, or in the server itself, which' IP number seems to be answered from multiple hosts. If we are indeed talking to multiple servers, the problem might be in the particular server we happen to get assigned to us.