tcp_init_wnd_chk

Not everything in the world is brand-new and sometimes your brand new server has to communicate with a really old TCP/IP enabled device. As such devices are sometimes really limited in resources, sometimes they don’t obey some of the best practices. This article will describe and issue arising out of this and how to circumvent it.

Description of the problem

All started with a customer calling: “Hey, in Solaris 10 a connection between an LDOM and such a TCP/IP device works like a charm. In a Solaris 11 LDOM it stopped working”. At the first, i thought “WTF”, but the solution is quite trivial. The customer had the following tcpdump (which i have obfuscated for obvious reasons):

 
  9 00:01:18.40820 client -> server TCP D=12345 S=12345 Syn Seq=282650168 Len=0 Win=0 Options=&lt;mss <b>1460</b>&gt;
 10 00:01:18.41183 server -> client TCP D=12345 S=12345 Syn Ack=282650169 Seq=1301349503 Len=0 Win=64240 Options=&lt;mss 1460&gt;
 11 00:01:18.41275 client -> server TCP D=12345 S=12345 Ack=1301349504 Seq=282650169 Len=0 Win=<b>560</b>
 12 00:01:18.86166 client -> server TCP D=12345 S=12345 Push Ack=1301349504 Seq=282650169 Len=400 Win=560
 13 00:01:18.87116 client -> server TCP D=12345 S=12345 Push Ack=1301349504 Seq=282650569 Len=160 Win=560
 14 00:01:19.54995 server -> client TCP D=12345 S=12345 Syn Ack=282650169 Seq=1301349503 Len=0 Win=64240 Options=&lt;mss 1460&gt;

client wants to establish a TCP-connection with server. The application and server software isn’t relevant here so i won’ talk about it here..

Explanation

There are two important pieces of information in this tcpdump that are relevant for the explanation:

Why are those both pieces of information relevant? In Solaris 11 a number of security mechanisms are activated by default. One of it is checking the TCP window size. In order to protect the system against certain kinds of denial of service attacks, the networking stack looks into the setup of the connection for a certain pattern that is usually just used by such attacks. Almost all TCP/IP stacks really wanting to do some work and not just wanting eating away your resources will request a TCP receive window size significantly larger than the maximum segment size (like four times) because you get a more steady throughput with such a configuration. It’s a good “suggestion” (as in: Do it this way!) to size the receive window at an integer factor of the maximum segment size. It’s really a good “best practice” and all sane TCP/IP stacks will do it this way. However denial of service attacks are based on the point that the attacked target runs faster out of resources than the attacker. So you work with small receive windows, as you have to allocate the receive windows size for an TCP/IP connection. So there is a simple check: You check for a situation, that is unlikely to see in normal operation. Like a client connecting with a receive window size smaller than the maximum segment size as this would have several negative impacts to performance. So: When the maximum segment size is larger than the window size in the third step of the handshake the system will just ignores the step. There is no connection from the perspective of the server, so it doesn’t have to allocate resources. In addition to that, there is an timer set, that expires such an connection attempt earlier from the respective tables in the operating system. On the other side: The client believes there is a connection (the server has answered with a SYN/ACK on the initial SYN) and doesn’t know that the server doesn’t consider this connection as completely initiated so it’s normal that the client is still starting to send data, but will never receive an ACK on it. The described mechanism helps a lot against denial of service attacks, however it’s based on the assumption that the communication partner obeys the best practices and for 99,99% of all devices this is a correct assumption. But not all TCP/IP stacks are doing this … some have to divert from this best practices in order to enable an device to communicate via TCP/IP despite having only minimal resource. The example above is such a case. The client is a devise with minimal resources: As shown by the above tcpdump, the client specifies a maximum segment size of 1460, but only a receive window size of 560. Obviously the connection attempt fails the test describes before and such the connection will not be established. Solaris 10 didn’t knew the described check. So it’s quite obvious why the TCP connection worked with Solaris 10 but not with Solaris 11.

Workaround

So you are in a situation. Your device can’t communicate. But especially with older hardware there is often no way to change the client and even when you can do it, often you don’t want to do it, because you don’t want to introduce a single small software change. Out of this reason, there is a way to get around the check on the server side. It’s connected to a second check made by the TCP/IP-stack. With larger segment sizes, more TCP stacks will not set an larger TCP receive window size, this is based on the fact that every tcp connection relates to an allocation of memory for the received data in the size of the receive window. This requests would fail as well. To get around this, Solaris 11 checks for an additional value called tcp_init_wnd_chk. The ACK is considered as invalid, when advertised window is smaller than both maximum segment size and tcp_init_wnd_chk. By default the value of this variable is 4096. So a connection attempt with a window size larger than 4096 Byte will accepted, even when the maximum segment size is larger (and thus would fail the described check without this second check). The value is dynamic so you can change it. So the solution for this customer was quite easy. Right after executing the command echo tcp_init_wnd_chk/W 0x0 | mdb -kw the connections were established without any problems again. With a set ip:tcp_init_wnd_chk = 0 in /etc/system the change was made boot persistent. Keepin in mind before doing so, that you just deactivated an security check. You system will be more susceptible against this special kind of DoS attack. On the other side, normally you have such embedded systems in networks that are tightly controlled and you would have a worse problem like this DoS when someone is able to kick such an attack off in your network (like in : the attacker has broken into your physical security perimeter). Nevertheless i wouldn’t use 0 but the smallest receive window you expect to see in packets with a larger maximum segement size than receive window size minus 1 byte. So far i saw the necessity to set this parameter only with really really old TCP/IP stacks (old in the sense of somewhat from last century) or TCP/IP stacks of embedded systems.

Demonstration

I wanted to check the solution after the customer told me “Everything is working again” and see the issue by my own eyes. However i hadn’t the opportunity for looking on the real hardware. Thus i simulated the issue by writing a small script. However i needed some tighter control of the stuff the client was sending to the server. So i generated the TCP packets by hand. For a different project i’m playing around with Scapy. Scapy is a toolset in python to generate packets the raw way circumventing the TCP/IP-Stack. As i wrote on Facebook a few days: Working with it to create TCP communication feels like bitbanging an I2C bus with GPIO pins. The script was called simulation.py. It’s really a raw hack …

 #! /usr/bin/env python

import random
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
from time import sleep
import sys
from scapy.all import \*

ip=IP(dst="192.168.1.104")
sourceport      = random.randrange(1024,65535)
client_seq      = random.randrange(1,4294967295)
windowsize=int(sys.argv[1])

HS_SYN=ip/TCP(sport=sourceport, dport=7, flags="S", seq=client_seq, window=windowsize,options=&#91;('MSS', 1460)&#93;)
HS_SYNACK=sr1(HS_SYN)

server_seq = HS_SYNACK.seq
server_seq = server_seq+1
client_seq = client_seq+1

HS_ACK= ip/TCP(sport=sourceport, dport=7, flags="A", seq=client_seq, ack=server_seq, window=windowsize,options=&#91;('MSS', 1460)&#93;)
send(HS_ACK)

for x in range(0, 3):
    DATA=ip/TCP(sport=sourceport, dport=7, flags="PA", seq=client_seq, ack=server_seq, window=windowsize,options=&#91;('MSS', 1460)&#93;)
    get='NARF\n'
    send(DATA/get)
    sleep(0.5)
    server_seq = server_seq+5;
    client_seq = client_seq+5;
    DATA_ACK= ip/TCP(sport=sourceport, dport=7, flags="A", seq=client_seq, ack=server_seq, window=windowsize,options=&#91;('MSS', 1460)&#93;)
    send(DATA_ACK)
    
FIN=ip/TCP(sport=sourceport, dport=7, flags="FA", seq=client_seq, ack=server_seq, window=windowsize,options=&#91;('MSS', 1460)&#93;)
send(FIN)
sleep(0.5)
ACK_OF_FIN=ip/TCP(sport=sourceport, dport=7, flags="A", seq=client_seq+1, ack=server_seq+1, window=windowsize,options=&#91;('MSS', 1460)&#93;)
send(ACK_OF_FIN)
 

The script takes a single command line parameter. It’s the size of the window the script will use for the connection. It essentially implements a client for the ECHO service provided by the INETD of a solaris system. Before you can use this script, there is an important prerequisite. The script totally circumvents the TCP/IP stack of the server. From the perspective of the clients OS this TCP/IP connection doesn’t exist. When a SYNACK packet for a TCP/IP connection that doesn’t exist arrived at the client, a security mechanism of the OS may kick in. The client sends a RST packet to terminate this connection to the server and nothing will work. You have to suppress this RST packets. I was using a notebook with ubuntu as a client, thus i used the following iptables command.

 iptables -A OUTPUT -p tcp --tcp-flags RST RST -s 192.168.1.106 -j DROP 

So, when i’m running this script as ./simulation.py 8192 on my client system there is a perfect output.

 275.609157 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [SYN] Seq=0 Win=560 Len=0 MSS=1460
275.609193 192.168.1.104 -> 192.168.1.106 TCP 58 echo > mao [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
275.662496 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [ACK] Seq=1 Ack=1 Win=8192 Len=0 MSS=1460
275.702554 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
275.702645 192.168.1.104 -> 192.168.1.106 TCP 54 echo > mao [ACK] Seq=1 Ack=6 Win=64240 Len=0
275.702728 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
276.239900 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [ACK] Seq=6 Ack=6 Win=8192 Len=0 MSS=1460
276.285099 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
276.285128 192.168.1.104 -> 192.168.1.106 TCP 54 echo > mao [ACK] Seq=6 Ack=11 Win=64240 Len=0
276.285238 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
276.828283 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [ACK] Seq=11 Ack=11 Win=8192 Len=0 MSS=1460
276.868573 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
276.868637 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
277.417182 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [ACK] Seq=16 Ack=16 Win=8192 Len=0 MSS=1460
277.467406 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [FIN, ACK] Seq=16 Ack=16 Win=8192 Len=0 MSS=1460
277.467480 192.168.1.104 -> 192.168.1.106 TCP 54 echo > mao [ACK] Seq=16 Ack=17 Win=64240 Len=0
277.467700 192.168.1.104 -> 192.168.1.106 TCP 54 echo > mao [FIN, ACK] Seq=16 Ack=17 Win=64240 Len=0
278.019460 192.168.1.106 -> 192.168.1.104 TCP 60 mao > echo [ACK] Seq=17 Ack=17 Win=8192 Len=0 MSS=1460

However, if you start it as ./simulation.py 560, the situation looks a little bit different, the connection isn’t working at all … for example there is no ECHO response.

 494.973579 192.168.1.106 -> 192.168.1.104 TCP 60 63339 > echo [SYN] Seq=0 Win=560 Len=0 MSS=1460
494.973635 192.168.1.104 -> 192.168.1.106 TCP 58 echo > 63339 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
495.033481 192.168.1.106 -> 192.168.1.104 TCP 60 63339 > echo [ACK] Seq=1 Ack=1 Win=560 Len=0 MSS=1460
495.069936 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
495.606109 192.168.1.106 -> 192.168.1.104 TCP 60 [TCP ACKed unseen segment] 63339 > echo [ACK] Seq=6 Ack=6 Win=560 Len=0 MSS=1460
495.606126 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 63339 [RST] Seq=6 Win=64240 Len=0
495.646050 192.168.1.106 -> 192.168.1.104 ECHO 63 [TCP ACKed unseen segment] Request
495.646079 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 63339 [RST] Seq=6 Win=64240 Len=0
496.116470 192.168.1.104 -> 192.168.1.106 TCP 58 [TCP Retransmission] echo > 63339 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460

It get’s a little bit more obvious is you comment out all lines in the script after the first sleep(0.5) and run it with ./simulation.py 560.You will see the retransmission the the SYNACK packet, like we saw it in the tcpdump of the customer.

 
  0.031040 192.168.1.106 -> 192.168.1.104 TCP 60 53442 > echo [SYN] Seq=0 Win=560 Len=0 MSS=1460
  0.031197 192.168.1.104 -> 192.168.1.106 TCP 58 echo > 53442 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
  0.070388 192.168.1.106 -> 192.168.1.104 TCP 60 53442 > echo [ACK] Seq=1 Ack=1 Win=560 Len=0 MSS=1460
  0.118710 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
  1.166558 192.168.1.104 -> 192.168.1.106 TCP 58 [TCP Retransmission] echo > 53442 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
  3.437331 192.168.1.104 -> 192.168.1.106 TCP 58 [TCP Retransmission] echo > 53442 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460

The connection stays in SYN_RECV as shown by netstat -a:

 
TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
template.5999              *.*                0      0 128000      0 LISTEN
template.4999              *.*                0      0 128000      0 LISTEN
192.168.1.104.ssh    MBP-of-c0t0d0s0.fritz.box.54913 131024      0 128872      0 ESTABLISHED
template.42623             *.*                0      0 128000      0 LISTEN
192.168.1.104.echo   linuxdesktop.fritz.box.53442     0      0 64240      0 SYN_RCVD

The reason is obvious when you put the state diagram of TCP in your mind. It switches to SYN_RECV when the initial SYN received and SYN ACK is send, however as the next SYN by the client is ignored, it never transitions to ESTABLISHED state. However when you use the mdb statement i have mentioned before, the connection will be established despite the small window as the check i described in the beginning is essentially deactivated. When i start the script with ./simulation.py 560 the tcpdump will look like this:

 0.029327 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [SYN] Seq=0 Win=560 Len=0 MSS=1460
  0.029367 192.168.1.104 -> 192.168.1.106 TCP 58 echo > 34945 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
  0.082751 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [ACK] Seq=1 Ack=1 Win=560 Len=0 MSS=1460
  0.118407 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
  0.118437 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 34945 [ACK] Seq=1 Ack=6 Win=64240 Len=0
  0.118546 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
  0.656412 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [ACK] Seq=6 Ack=6 Win=560 Len=0 MSS=1460
  0.699617 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
  0.699648 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 34945 [ACK] Seq=6 Ack=11 Win=64240 Len=0
  0.699747 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
  1.344686 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [ACK] Seq=11 Ack=11 Win=560 Len=0 MSS=1460
  1.345173 192.168.1.106 -> 192.168.1.104 ECHO 63 Request
  1.345401 192.168.1.104 -> 192.168.1.106 ECHO 59 Response
  1.960432 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [ACK] Seq=16 Ack=16 Win=560 Len=0 MSS=1460
  1.961160 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [FIN, ACK] Seq=16 Ack=16 Win=560 Len=0 MSS=1460
  1.961204 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 34945 [ACK] Seq=16 Ack=17 Win=64240 Len=0
  1.961449 192.168.1.104 -> 192.168.1.106 TCP 54 echo > 34945 [FIN, ACK] Seq=16 Ack=17 Win=64240 Len=0
  2.575779 192.168.1.106 -> 192.168.1.104 TCP 60 34945 > echo [ACK] Seq=17 Ack=17 Win=560 Len=0 MSS=1460

It pretty much looks like the connection made with a window of 8192. Changing the value of tcp_init_wnd_chk had the desired effect.