Service degradation due to API timeout errors

Incident Report for Fluid Attacks

Postmortem

Impact

At least one user experienced difficulties while trying to review vulnerabilities, as the platform failed to load correctly. The issue started on UTC-5 25-11-28 15:23 and was proactively discovered 21 minutes (TTD) later by a staff member who reported through our help desk [1] that the vulnerabilities view remained stuck in a loading state. Following this initial report, an additional customer report arrived, confirming that the problem was affecting multiple users. The problem was resolved in 10 minutes (TTF), resulting in a total window of exposure of 31 minutes (WOE).

Cause

A large batch of approximately 3,900 automated tasks was executed against the platform. Each task performed several operations that required intensive use of the system, and up to 200 of them were running simultaneously. This created a sudden and unusually high amount of activity that the platform was not able to handle quickly enough.

Because the platform needs several minutes to increase its capacity when activity spikes, it continued receiving more and more requests before it was ready to support them. This led to delays, timeouts, and error responses for about half an hour, affecting both the automated tasks and regular users who were trying to interact with the platform during that period.

The incident was not caused by a recent update, but rather by a combination of a huge volume of simultaneous work and the current limitations of the platform’s ability to adapt to sudden increases in demand.

Solution

Stopping the ongoing tasks was not possible because, by the time the cause was clearly identified, most of them had already been submitted and were nearly complete. The situation was monitored closely until the workload naturally decreased. Once the activity level went down, the platform gradually recovered and returned to normal operation.

Conclusion

To prevent similar incidents, we will avoid generating large, highly concurrent workloads against the platform until it is better prepared to support them, adjust internal workflows so the platform receives fewer unnecessary requests, and continue improving its ability to scale more quickly during periods of high demand, ensuring greater stability and a smoother experience for all users. PERFORMANCE_DEGRADATION < INCOMPLETE_PERSPECTIVE

Posted Dec 01, 2025 - 18:01 GMT-05:00

Resolved

The incident has been resolved, and all API services are now operating as expected.

Posted Nov 28, 2025 - 16:00 GMT-05:00

Identified

It has been identified that a timeout error in one of the core API services caused requests to exceed the expected response time. This issue led to noticeable degradation across the platform.

Posted Nov 28, 2025 - 15:45 GMT-05:00

This incident affected: Platform.