Learnings from a 5-hour production downtime!
As with all the incidents, it happened on a Friday evening!
In this article, I'll delve into the causes and prolonged recovery process behind a recent 5-hour downtime in one of our critical production services.
The affected service, a Node.js application, manages data transactions with PostgreSQL, sustaining peak loads of 250K requests per minute. Our server infrastructure is orchestrated via Kubernetes, with AWS RDS serving as the backend database.
The Beginning (5 PM):
The problem started around 5 PM when we started receiving unusually high traffic on the servers, 3-4 times the normal traffic. Due to this increase in traffic, the database server started degrading and in 15 minutes database degraded so much that it was barely able to process any queries.
First Response (5:20 PM)
We investigated possible causes for the traffic surge, such as a marketing campaign, but found nothing conclusive and the traffic was still increasing. To manage the traffic and allow the database to recover, we implemented temporary rate-limiting rules on our firewall. This resulted in a decrease in traffic and signs of database recovery.
Second Attack (5:45 PM)
Just as we believed the incident had concluded, the RDS console flashed 'Storage Full.' The database had exhausted its storage capacity, rendering it unable to process any new requests. Knowing that AWS allows easy storage expansion, we promptly increased the storage capacity. To our surprise, we saw an error that storage cannot be increased. After multiple unsuccessful attempts to increase the storage, we found that in AWS, the storage of an RDS server cannot be increased more than once in 6 hours (AWS reference).
Storage optimization can take several hours. You can't make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer
But we recalled that we haven’t increased the storage in the last 6 hours, then who did?
Hidden Attack (5:30 PM)
In AWS, you can configure auto-scaling for storage, allowing an automatic increase in storage when it reaches near capacity. Our database had auto-scaling configured. By 5:30 PM, the surge in traffic had already pushed the database storage to its scale-up threshold, triggering an automatic scale-up. This meant that we would not be able to increase the storage for the next 6 hours!
No way to escape
We couldn't afford to wait six hours to increase storage because the period between 5-10 PM sees the highest traffic. Given the critical nature of this service, any delay would severely impact user experience and business operations. We considered restoring a backup on a new RDS server and decommissioning the current one. However, since the last backup was taken 3 hours ago, implementing this solution would result in a loss of 3 hours of data.
Ray of Hope (6:30 PM)
After consulting with service owners, we concluded that losing 3 hours of data was acceptable. The nature of the service was such, that once the service is back online, any lost data, will be recreated. So we started preparing for the the point-in-time recovery of the database. We provisioned a new RDS server mirroring the current configuration, but with expanded storage, and initiated the backup restoration process. Anticipating from previous experiences, we estimated the restoration process to take approximately 20-30 minutes.
Darkness once again (7:15 PM)
Even after 45 minutes, the restoration process was not complete. We started checking why it was taking so much time (there is no progress bar while restoring so we didn’t know if it would be done in 10 minutes or if it would take 10 more hours). We discovered that the server's CPU usage was almost 100%, likely causing the restoration slowdown. However, increasing CPU capacity wasn't feasible as it required changing the RDS instance type, something that couldn't be done while the restoration was in progress.
Back to square one (8:00 PM)
After waiting for 45 more mins we decided to increase the CPU. The only solution was to create a new server and initiate the backup restoration process again. We kept the current server on which the CPU was becoming a bottleneck for restoration and simultaneously started restoration in a new server with 3 times the CPU, hoping that we will use the one that gets finished early. In the new server, the CPU was no longer a bottleneck, stabilising at 50-60%.
Still Not Done (9:00 PM)
Even after an hour, the backup process continued on both servers. Concerned about other potential bottlenecks, we began checking metrics for the new server. Turned out that this time IOPS was the bottleneck (IOPS is the measure of Disk IO that can be done per second, IO requests beyond the threshold are throttled). We paled at the thought of having to restart the recovery process from scratch, once again!!
The Last Stand:
Fortunately, AWS allows increasing the IOPS during the backup restore process. Doubling the IOPS resolved the bottleneck. Finally, by 10 PM, we successfully completed the backup restoration and updated the service configuration to connect to the new server, by 10:15 service stabilised and it started handling traffic as usual. The next day we reduced all the resources that we over-provisioned during restoration.
*****
This was a 5-hour fire-fighting that we did to restore the service, navigating through many unknowns. There are a few things that we could have done to avoid this prolonged downtime which are discussed in the next section.
Learnings:
- Maintain adequate free storage on database servers, before the incident our database server was already running at 90% storage, we could have proactively increased the storage to avoid storage bottlenecks during the incident.
- While restoring a backup, database server resources should be over-provisioned to avoid bottlenecks.
- The rate limiting implemented reactively during the incident should have been implemented proactively to manage sudden traffic spikes before they impact the servers.
Thanks for reading!
Comments
Post a Comment