Learnings from a 5-hour production downtime!
As with all the incidents, it happened on a Friday evening! In this article, I'll delve into the causes and prolonged recovery process behind a recent 5-hour downtime in one of our critical production services. The affected service, a Node.js application, manages data transactions with PostgreSQL, sustaining peak loads of 250K requests per minute. Our server infrastructure is orchestrated via Kubernetes, with AWS RDS serving as the backend database. The Beginning (5 PM): The problem started around 5 PM when we started receiving unusually high traffic on the servers, 3-4 times the normal traffic. Due to this increase in traffic, the database server started degrading and in 15 minutes database degraded so much that it was barely able to process any queries. First Response (5:20 PM) We investigated possible causes for the traffic surge, such as a marketing campaign, but found nothing conclusive and the traffic was still increasing. To manage the traffic and allow the database to reco