Learnings from a 5-hour production downtime!

As with all the incidents, it happened on a Friday evening! In this article, I'll delve into the causes and prolonged recovery process behind a recent 5-hour downtime in one of our critical production services. The affected service, a Node.js application, manages data transactions with PostgreSQL, sustaining peak loads of 250K requests per minute. Our server infrastructure is orchestrated via Kubernetes, with AWS RDS serving as the backend database. The Beginning (5 PM): The problem started around 5 PM when we started receiving unusually high traffic on the servers, 3-4 times the normal traffic. Due to this increase in traffic, the database server started degrading and in 15 minutes database degraded so much that it was barely able to process any queries. First Response (5:20 PM) We investigated possible causes for the traffic surge, such as a marketing campaign, but found nothing conclusive and the traffic was still increasing. To manage the traffic and allow the database to reco

Index-Only Scan in Postgresql is not always Index “Only”!

  An Index-only scan is supposed to return query results just by accessing the index but in Postgresql, an index-only scan can end up accessing table rows (heap memory) as well, which might result in the query taking more time (or other resources) than anticipated. In this blog, I will discuss how we discovered this behavior of Postgresql and how we solved this for our use case. The Problem: We optimized a high IO-consuming read query some time back ( detailed blog ). The optimization we had done was to create appropriate indexes so that query can be resolved using an index-only scan so that there is no need to read table rows, thereby reducing IOPs (Input-Output per second) consumed by the query. But a few weeks down the line we again started observing a gradual increase in IOPs consumed by the query. On checking the query plan, it was still using index-only scan but we found that the query was also doing a lot of disk access and it was accessing heap memory as well. It was not intuit