RCA & COE for Latency Spike and Timeouts on 30th Dec’25
Incident Summary
|
Field
|
Value
|
|---|---|
Date
| 30th December 2025
|
Duration
| ~ 2 minutes (17:04 to 17:06)
|
Impact
| Increased latency and 504s on Juspay Order Create, Order Status and Transaction APIs
|
Detection
| Internal monitoring alerts
|
Cause of Incidence
| Multiple database instances restarted at the same time
|
Severity Level
| High
|
On 30 December 2025, Juspay's Order Create, Order Status and Transaction APIs experienced a spike in latency and 504 responses for approximately two minutes. The issue affected all merchants and was caused by a transient disruption in one of our Aurora database clusters.
Root Cause Analysis
On checking the logs, we could see that all instances of our Aurora database cluster restarted and our database queries had increased latency for around 2 minutes.
We do not rely on the writer instance being available to serve merchant traffic. Additionally, when one or more reader instances fail, our application retries the queries with available healthy reader instances. However, during the same time all the reader instances also restarted. Because of this, we observed an increase in latency in our APIs.
We raised a ticket with AWS to understand the root cause behind the restarts of the instances in our database cluster. AWS said that the restarts happened because the VDL (Volume Durable LSN) was stuck, and that it can happen due to a spike in the write workload. However, our write workload is constant and there was no change in the traffic or query pattern before or during the issue period. We are still working with AWS to investigate the root cause of the stale VDL.
Why did the incident occur?
All the instances (readers and writer) in one of our Aurora database clusters restarted at the same time, which caused latency spikes in our API.
Why do we need writer/reader database instances to serve traffic?
Our writes go through a KV cache layer, and hence we do not need the writer instance to be available to serve traffic. However, we still need to read some data from our DB if it's not available in our KV layer.
Why did the instances restart?
According to the AWS response, the VDL was stuck, which triggered restarts in the instances. This is default behavior in AWS Aurora to ensure consistency in the cluster.
Why was the VDL stuck?
AWS mentioned that the write workload was high, but there was no change in traffic or query volume before or during the incident.
Resolution & Corrective Actions
First, to eliminate the possibility of high write workload as a cause, we have decreased the rate of writes to our database as an immediate fix. We are able to do this since we have a KV cache layer that can absorb write workload spikes.
Second, we will further optimize writes to our database by batching updates. Since most of our traffic can be served with recent data, our architecture is already designed to minimize queries to our database by using a KV cache layer. So as the third corrective action, we will make changes to ensure that we can continue to serve traffic in case our database cluster is completely unavailable for short durations.
|
Category
|
Corrective Action
|
ETA
|
|---|---|---|
Immediate
| Decrease the rate of writes to our database
| Done
|
Medium Term
| We will optimize writes to the database by batching updates
| Jan 30, 2026
|
Medium Term
| Changes to serve traffic for recent/new orders even when database cluster is completely unavailable for short durations
| Feb 13, 2026
|
- Have questions?
- Need help? Contact support
- LLM? Read llms.txt

