PLATFORM has failed

Incident Report for Fluid Attacks

Postmortem

Impact

Our users experienced degradation and, in some cases, unavailability of our Platform, API, and related services. The issue started on UTC-5 25-01-09 17:31 and was proactively discovered 13.6 hours (TTD) later by our engineering team while reviewing synthetic monitoring failed alerts, error logs, and performance metrics [1]. The problem was resolved in 1.2 hours (TTF), resulting in a total window of exposure of 14.8 hours (WOE) [2].

Cause

As described in the AWS documentation, many requests to SQS reached the DNS rate limit. Consequently, requests for other services, such as DynamoDB, S3, and others, began to fail. The issue was linked to a change involving instrumenting read events for the Root entity the day before. This entity is frequently used, and there are many roots in some clients [3].

Solution

The instrumentation of read events for the Root entity was removed [4].

Conclusion

Our team enabled DNS caching for the entire cluster, significantly reducing the load on the real DNS, making it less likely to become overloaded. INCOMPLETE_PERSPECTIVE < IMPOSSIBLE_TO_TEST

Posted Jan 21, 2025 - 07:03 GMT-05:00

Resolved

The platform has remained stable for several hours, and we are now closing this incident. A detailed postmortem report will be published to provide insights into the incident and the measures taken to address it.

Posted Jan 10, 2025 - 11:44 GMT-05:00

Monitoring

Since this morning, we have been experiencing delays and downtimes in our web application. The issue has been traced to a problem with DNS resolution within our Kubernetes cluster. Our team is actively working to resolve this and implementing necessary measures to prevent future occurrences.

Posted Jan 10, 2025 - 11:05 GMT-05:00

Identified

An issue was found in PLATFORM.
Click for details: https://availability.fluidattacks.com

Posted Jan 10, 2025 - 07:15 GMT-05:00

This incident affected: Platform.