Website downtime 6 July
6 July 2012
In the afternoon/evening of 6 July websites on our main server became inaccessible due to server load problems which are being investigated. We apologise for any inconvenience and will post more information as it comes to hand.
The web server was overloaded for a period of some hours either by a form of DoS or a runaway script. Access was possible for a time, but very very slow leading to web browser timeouts.
Update: the web server used up all of it's memory around 13:35 AEST, causing the database and then other processes to slow down and by 16:00 be unresponsive.
We were able to access the server console around 17:00, but because of the memory overload it took more than two hours to execute the commands required to correct the situation.
The cause hasn't yet been identified and as this is unlikely to happen under normal conditions we have installed some extra monitoring programs to analyse memory usage over the next 24 hours.
The good news is that the server didn't crash or shut down and no data was damaged. There are also no signs of foul play.
Update 2: the conclusion now is that the server memory usage was pushed over the edge by a new backup script. In the last week we've implemented a system of incremental, pgp-encrypted, backups between the web server and other locations. This went live at the end of last week.
The daily backup has now been re-scheduled to run later in the day and, as mentioned before, we are carefully monitoring the server metrics to see if there is a RAM bottleneck that needs to be addresses.
Update 3: the backup scripts have run without problems today (7 July). We've also tweaked some settings to rein in memory usage by Apache which was previously able to reach critical levels during peak traffic times.