Impact
An unknown number of internal users found the Fluid Attacks Platform unavailable. The issue started on UTC-5 25-10-21 12:51 and was immediately discovered by one of our monitoring tools, indicating the outage of these components. The problem was resolved in 14.4 minutes (TTF), resulting in a total window of exposure of 14.4 minutes (WOE).
Cause
A large volume of simultaneous internal processes triggered an unexpected overload in one of our core services, which temporarily degraded its performance and caused a platform outage. The root cause was the number of concurrent requests exceeding the system’s ability to scale quickly.
Solution
We optimized the internal scripts responsible for launching these processes to limit the number of concurrent requests and reduce system load. These changes have already been applied, and further improvements are planned to prevent similar issues in the future.
Conclusion
This incident revealed a scalability limitation in one of our core services. We are working on improvements to make the system more resilient and capable of handling high-demand scenarios without affecting availability. INFRASTRUCTURE_ERROR < PERFORMANCE_DEGRADATION