net.ipv4.tcp_tw_recycle
.
If you have information regarding TCP
states, then this article may be of use to you specifically if you are trying to debug an issue with dropped/timeout traffic involving NAT
devices on cloud or own infrastructure alike. But if you are still not conversant with TCP much, read Part-1 of this article which touches on the basics.
So this incident happened while my time (~November, 2016) at Zomato
and we had enabled third party integrations for wallet based payment with FreeCharge
for Zomato Order
. I was looking after Payments@Zomato back then. We were having a setup wherein a lot(300+ per minute) of outbound connections, mostly wallet balance polling API calls to FreeCharge servers are made to update the Current Balance of a User Active on Zomato App in Online Ordering Flow. And we have a NAT device for all outbound traffic from our payment servers. So there are supposed to be many TIME_WAIT sockets left in the system of active closer, as I described in my last post at any given time. Now in our case, FreeCharge was the active closer, and was dropping incoming connections from us randomly, out of the 300+ calls we made to them every minute.
Now there a lot of guides/posts available on internet on tuning systems to allow for higher throughput which also involve changing sysctl
configuration changes. And many of those guides would tell you to change the values of tcp_tw_recycle
and tcp_tw_reuse
to 1.
Now, the Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle
does:
Enable fast recycling
TIME-WAIT
sockets. Default value is 0. It should not be changed without advice/request of technical experts.
net.ipv4.tcp_tw_reuse
is a little bit more documented but the language is near about the same:
Allow to reuse
TIME-WAIT
sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.
The direct result of lack of documentation or information about the uses/drawbacks of changing such configuration values is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT
state. However, tcp(7) manual page states the issue about net.ipv4.tcp_tw_recycle
option clearly which is a problem hard to detect and waiting to bite you :
Enable fast recycling of
TIME-WAIT
sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).
So let's try to understand the problem now and try to correct other people on Internet. :)
What is tcp_tw_recycle
used for ?
There are mainly two reasons of using this option :
While
tcp_tw_recycle
option significantly shortens the length of time that a socket must stay in theTIME_WAIT
state. But as we know fromTCP
Connection lifecycle, a socket in theTIME_WAIT
state is unable to accept a new connection until theTIME_WAIT
period expires. Too many sockets inTIME_WAIT
on a busy server can eventually cause port exhaustion, preventing any new connections from being formed until some of theTIME_WAIT
periods expire.tcp_tw_recycle
is able to provide the same protection whichTIME_WAIT
offers by making use of TCP Timestamps, specified in RFC 1323.
Basically, the active closer remembers the last timestamp value sent by the client’s IP address. If a new connection is received where the TCP timestamp is larger than the last recorded timestamp, then we can be sure that the packet is new, not an old duplicate. If the packet has a Timestamp value older than the last noted, then we can safely assume that the packet is from an old connection and should be deemed to be dropped. Technically, RFC 1122 states:
When a connection is closed actively, it MUST linger in
TIME_WAIT
state for a time2*MSL
(Maximum Segment Lifetime). However, it MAY accept a newSYN
> from the remote TCP to reopen the connection directly fromTIME_WAIT
state, if it:(1) assigns its initial sequence number for the new connection to be larger than the largest sequence number it used on the previous connection incarnation, and
(2) returns to
TIME_WAIT
state if theSYN
turns out to be an old duplicate.
The feud of NAT with tcp_tw_recycle
As highlighted in previous section, the last seen timestamp is recorded per IP address. Now when multiple client machines are behind a NAT
, it can pose a serious trouble, since all of them will have same source IP from server's point of view.
Moreover, TCP timestamp values are generated pseudo-randomly, so each device on the network will have a different timestamp value which can lead to some devices behind a shared IP address passing the tcp_tw_recycle
test and others failing and being unable to connect. So, at the end it's all random, which somehow happened to be our case too, wherein out of the two machines which were behind the NAT device, requests from either of them C would fail in random fashion. To explain all this is a simple manner, let us consider the following example :
- Client 1 and 2 are behind a firewall with NAT.
- Client 1 makes a successful web request to a WebServer A which has
tcp_tw_recycle
enabled. Note, that since NAT is enabled, for WebServer A, the request seem to come from Source IP of A, instead of either of client. - Let's say WebServer actively closes the connection after sending the necessary data. So the socket in question changes out of
TIME_WAIT
state and takes a note of the last TCP Timestamp it received fromNAT
device IP(since Client 1's IP is NAT'ed by theNAT
device). - Now, if Client 2 tries to connect to the WebServer A by sending a
SYN
packet, and if its TCP timestamp value happens to be smaller than Client 1’s, the Timestamp value is compared to the previously seen timestamp by the WebServer A and on seeing that the new timestamp is smaller than the previous one, theSYN
is dropped. - Client 2 is unable to communicate with the WebServer until the
TIME_WAIT
period expires.
How to fix it?
$ sysctl net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 1
Modify the value for the configuration in the file to 0. You need to have root access to do that.
$ cat /etc/sysctl.conf | grep tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 0
$ sysctl -p /etc/sysctl.conf
Since packets are dropped, the client is actually waiting for a response back until it times-out. And that is the same thing which was happening with us particularly. Random requests would start timing out and the problem would get fixed on its own after few seconds. And when we reached out to FreeCharge's team, they were unable to trace any calls to their application during that window, since the request never the application at the backend of the WebServer, the SYN
packets were just being dropped. After careful research and help from Shrey Sinha(Head of Infrastructure @ Zomato) and FreeCharge's team, we were able to get the problem fixed.
Further Reading :
- http://slashtwentyfour.net/2016-09-24-tcp_tw_recycle-dangers/ - The article also mentions about a script which the author wrote to find the number broken servers on internet with such issue.
- https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux - This article also mentions about the effects and use of
tcp_tw_reuse
on client and server end.
Comments is loading...