The Curious Case of "tcp_tw

TL;DR: do not enable net.ipv4.tcp_tw_recycle.

If you have information regarding TCP states, then this article may be of use to you specifically if you are trying to debug an issue with dropped/timeout traffic involving NAT devices on cloud or own infrastructure alike. But if you are still not conversant with TCP much, read Part-1 of this article which touches on the basics.

So this incident happened while my time (~November, 2016) at Zomato and we had enabled third party integrations for wallet based payment with FreeCharge for Zomato Order. I was looking after Payments@Zomato back then. We were having a setup wherein a lot(300+ per minute) of outbound connections, mostly wallet balance polling API calls to FreeCharge servers are made to update the Current Balance of a User Active on Zomato App in Online Ordering Flow. And we have a NAT device for all outbound traffic from our payment servers. So there are supposed to be many TIME_WAIT sockets left in the system of active closer, as I described in my last post at any given time. Now in our case, FreeCharge was the active closer, and was dropping incoming connections from us randomly, out of the 300+ calls we made to them every minute.

Now there a lot of guides/posts available on internet on tuning systems to allow for higher throughput which also involve changing sysctl configuration changes. And many of those guides would tell you to change the values of tcp_tw_recycle and tcp_tw_reuse to 1.

Now, the Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle does:

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

net.ipv4.tcp_tw_reuse is a little bit more documented but the language is near about the same:

Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.

The direct result of lack of documentation or information about the uses/drawbacks of changing such configuration values is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, tcp(7) manual page states the issue about net.ipv4.tcp_tw_recycle option clearly which is a problem hard to detect and waiting to bite you :

Enable fast recycling of TIME-WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).

So let's try to understand the problem now and try to correct other people on Internet. :)

What is `tcp_tw_recycle` used for ?

There are mainly two reasons of using this option :

While tcp_tw_recycle option significantly shortens the length of time that a socket must stay in the TIME_WAIT state. But as we know from TCP Connection lifecycle, a socket in the TIME_WAIT state is unable to accept a new connection until the TIME_WAIT period expires. Too many sockets in TIME_WAIT on a busy server can eventually cause port exhaustion, preventing any new connections from being formed until some of the TIME_WAIT periods expire.
tcp_tw_recycle is able to provide the same protection which TIME_WAIT offers by making use of TCP Timestamps, specified in RFC 1323.

Basically, the active closer remembers the last timestamp value sent by the client’s IP address. If a new connection is received where the TCP timestamp is larger than the last recorded timestamp, then we can be sure that the packet is new, not an old duplicate. If the packet has a Timestamp value older than the last noted, then we can safely assume that the packet is from an old connection and should be deemed to be dropped. Technically, RFC 1122 states:

When a connection is closed actively, it MUST linger in TIME_WAIT state for a time 2*MSL (Maximum Segment Lifetime). However, it MAY accept a new SYN > from the remote TCP to reopen the connection directly from TIME_WAIT state, if it:

(1) assigns its initial sequence number for the new connection to be larger than the largest sequence number it used on the previous connection incarnation, and

(2) returns to TIME_WAIT state if the SYN turns out to be an old duplicate.

The feud of NAT with `tcp_tw_recycle`

As highlighted in previous section, the last seen timestamp is recorded per IP address. Now when multiple client machines are behind a NAT, it can pose a serious trouble, since all of them will have same source IP from server's point of view.

Moreover, TCP timestamp values are generated pseudo-randomly, so each device on the network will have a different timestamp value which can lead to some devices behind a shared IP address passing the tcp_tw_recycle test and others failing and being unable to connect. So, at the end it's all random, which somehow happened to be our case too, wherein out of the two machines which were behind the NAT device, requests from either of them C would fail in random fashion. To explain all this is a simple manner, let us consider the following example :

Client 1 and 2 are behind a firewall with NAT.
Client 1 makes a successful web request to a WebServer A which has tcp_tw_recycle enabled. Note, that since NAT is enabled, for WebServer A, the request seem to come from Source IP of A, instead of either of client.
Let's say WebServer actively closes the connection after sending the necessary data. So the socket in question changes out of TIME_WAIT state and takes a note of the last TCP Timestamp it received from NAT device IP(since Client 1's IP is NAT'ed by the NAT device).
Now, if Client 2 tries to connect to the WebServer A by sending a SYN packet, and if its TCP timestamp value happens to be smaller than Client 1’s, the Timestamp value is compared to the previously seen timestamp by the WebServer A and on seeing that the new timestamp is smaller than the previous one, the SYN is dropped.
Client 2 is unable to communicate with the WebServer until the TIME_WAIT period expires.

How to fix it?

$ sysctl net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 1

Modify the value for the configuration in the file to 0. You need to have root access to do that.

$ cat /etc/sysctl.conf | grep tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 0

$ sysctl -p /etc/sysctl.conf

Since packets are dropped, the client is actually waiting for a response back until it times-out. And that is the same thing which was happening with us particularly. Random requests would start timing out and the problem would get fixed on its own after few seconds. And when we reached out to FreeCharge's team, they were unable to trace any calls to their application during that window, since the request never the application at the backend of the WebServer, the SYN packets were just being dropped. After careful research and help from Shrey Sinha(Head of Infrastructure @ Zomato) and FreeCharge's team, we were able to get the problem fixed.

Further Reading :

http://slashtwentyfour.net/2016-09-24-tcp_tw_recycle-dangers/ - The article also mentions about a script which the author wrote to find the number broken servers on internet with such issue.
https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux - This article also mentions about the effects and use of tcp_tw_reuse on client and server end.

Erudition

Thoughts, learnings and wisdom of a Software Engineer.

The Curious Case of "tcp_tw_recycle"

What is `tcp_tw_recycle` used for ?

The feud of NAT with `tcp_tw_recycle`

How to fix it?

What is tcp_tw_recycle used for ?

The feud of NAT with tcp_tw_recycle

How to fix it?

What is `tcp_tw_recycle` used for ?

The feud of NAT with `tcp_tw_recycle`