We at our company started having regular server downtime for 3-4 days exactly around our peak load timeperiod.(i.e: around 2 pm IST). Now debugging it was turning into a hard issue due to the fact that we have a fairly(above average?) level of complexity in our application. Not to mention, we have so far resisted or avoided fiddling with linux server internal configurations.*
But this time, it turned out to be unavoidable. So to stick to the original evolution of the storypoint, first response on looking at our dashboard of all possible signals,
is hey look that jserr also seems to have gone up around the sametime. Not to mention our email boxes get flooded with, critical error in websocket server messages. On a look at the logs of websocket server we see messages like “”.
We also see a clear spike in auth login failure graph. So first preventive action after first day of down time is to spin up more login servers.
So we decide that the websocket servers get flooded too much during peak load, so we will spin up more servers.
Result on the next day, we still have same spike and down time, but only 1 hr later.
Ok, clearly that more login servers helped, but was not the core cause nor did it completely eradicate the problem. Time to setup deeper investigation tools. First attempt was to stay up around till the peak load time, and run sar -A close to the times.
While that showed up high paging faults, and context switches, it was not clear what was the cause.
Besides, staying up and monitoring actively led to proactively restarting some of the servers/processes that seem to have stopped the peak load and downtime.
But 4-5 hrs later than the usual, the servers did go down again.
About a week or so before this, we had decided mongod was causing too much paging and for our usage, it is not a good choice for a backend memory application, and we should move most of our data to redis, especially chat rooms and messages.
So we had started work on it and were testing the changes to migrate to redis around this time. So we decided accelerate the testing and release on this and pooled efforts together.
Result, we had a good solid working copy of code that uses redis to create new rooms, nad messages and send, and distribute them.
so we went ahead and deployed them, but discovered that our code to migrate old and existing rooms was taking quite a long time, due to a whole lot of old private rooms ** present.
But this deploy seemed to have gone stable on the production and for a change we went for about 30 hrs time period without a server down time. But then saturday early morning it went down again.
This time, a colleague noticed that there’s a whole lot of tcp open sockets/connections that are in SYN_RECV state. i.e: output of netstat -atn had a whole lot of connections in that state.
Based on my understanding of this http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html, our current tcpkeepalive settings on kings’ landing consider a connection dead only after 2 hrs 11 minutes . (7200 + 75* 9 seconds)
each of those 3 values can be configure in
# echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
# echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
# echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes
I am changing the first tcp_keepalive_time to 600. It means the number of seconds of no packet activity on a socket before sending a tcp_keepalive packet to check if it’s alive. ..
By doing this, dead connections will be detected in 21 minutues of inactivity( 600 + 75*9 seconds).
One of the downsides of this is if someone opens up a chat room starts typing but hits send only after 21 minutes, the client side app will have to reacquire a connection.
Dead connections/ inactive app clients are being cleaned up after 2 hrs and 11 minutes. But there’s a limit on how many sockets our server can handle/provide/keep active. so during peak load it runs out of sockets to distribute.
Assumptions that may fail:
1. The app client is sensible to reconnect/reacquire a connection before sending( or atleast if there’s a inactivity time gap before send).
We shouldn’t have a server failure at peak loads. Infact we currently see a failure/sudden drop at 25 messages/5s(or whatever x-axis that graphite shows).
But after this change, we should see our messaging rate peak go up, without sudden drop.
To undo these changes just run
# echo 7200 > /proc/sys/net/ipv4/tcp_keepalive_time
* — The core reason simply being we would like to scale our app, to 100x or more of current times, and poking around kernel parameters, and configurations makes it harder to leverage the power of cloud and instant instance creations etc.. (though with ec2’s tagging etc.. may not be all that hard)
** — Private messages to another user or to a group chat.