Spent an hour debugging server crashing errors

Floris · Oct 9, 2017

Earlier this morning I had a bunch of text messages flood my phone. It turned out that a server array was having issues staying online. As one failover was successful, the other came back and data recovery completed and changes were synced. Only to find the failover to fail, and the process repeating itself.

It took quite a bit of time to login and remain active long enough with some forked shell sessions (for easier reconnecting etc) and gather up some hardware diagnostic and syslog files, etc.

Thousands of lines of log files were read by me and someone else and a temporary server was hired by the guy who owns the company to add another stable ww1. dns alternative and give us space to work on things without any actual downtime.

For the visitor, when one server dies, the failover kicks in and they get a reconnected socket, so there's no downtime, but it adds a weird little 'why isn't this picture loading.. ' for a second. But the site remains reachable.

Figuring out there was no attack going on, no flooding or any hacking etc, it was a matter of narrowing down what's actually happening.

The log files kept pointing to a range of procs that seemed to be colliding and not being able to properly store the data, triggering the fallback or failover setup scripts and data recovery kicked in before anything was completed.

Thankfully, I never wrote those shell scripts as it was too complex for me, and apparently too complex for that developer as well. But I saw the error and fixed a few if conditions and tried to simplify the setup and using async checking of the network connections against the array data.

Things stabilized and the ram and cpu spikes disappeared.. first of course the data had to sync up between all connected arrays and secondly the customers visiting the site had to get the guarantee their preferences, account data, purchases and all that stuff actually happened so the owner of the company got someone else to show up early on Monday and double check the purchases made and pending shopping carts etc. Thankfully everything was fine. It was just software failing causing the hardware to be overly busy.

I am sharing this, very simplified and probably 30% incorrectly, because I am not too familiar with the architecture of their network and hardware, because it's important to realise and understand how much it means to someone who owns a site that all the little things they have running to help things run smoothly can also work against them. It can cause unforeseen situations, domino effect, and without the proper alert systems this could have been going on for days until sales might have been lost, etc.

We talked about the consequences of future hardware or software failures and the guy is going to hire a developer to convert the little setups into an easier to handle and more modern solution so debugging what's up is easier.

At least the failover kicked in, and at least the recovery kicked in. And thankfully when I recommended the last time to have a sync live data feature as well, they made one and it made sure the end user browsing the site spending their money didn't get charged or marked as 'never paid', etc. And i love how my shell scripts that check semi frequently if certain data is there triggers the alerts rather quickly so we got there in time before it became terrible and the owner of the company had to go to the hosting company to restart and debug on site.

Sometimes I am glad I just have a domain with a small website and if it's down: so be it..

But imagine making thousands of bucks a day and having thousands of bucks in costs everyday.. and then running into a situation where you can't have a business for days trying to understand why your site goes offline.

The guy invested a bit of extra money in an array of servers and services for redundancy, and has been online since with it. Saving him money in the long run. And yes, it doesn't happen a lot when companies take their business serious and only want the things they hear their competitors are doing.

Spent an hour debugging server crashing errors

Floris

I'm just me :) Hi.

Share this page