PLATFORM has failed
Incident Report for Fluid Attacks
Postmortem

Impact

Our users experienced degradation and, in some cases, unavailability of our Platform, API, and related services. The issue started on UTC-5, 25-01-09 at 17:31 and was reactively discovered 14 hours (TTD) later by our engineering team while reviewing synthetic monitoring failed alerts, error logs, and performance metrics[1]. The problem was resolved within 1 hour (TTF), resulting in a total impact of 15 hours (TTR) [2]

Cause

A high number of requests to SQS reached the DNS rate limit, as described in the AWS documentation. Consequently, requests to other services such as DynamoDB, S3, and others began to fail. The issue was linked to a change made the day before [3], which involved instrumenting read events for the Root entity. This entity is frequently used, and in some clients, there is a large number of roots.

Solution

The instrumentation of read events for the Root entity was removed [4].

Conclusion

Our team is continuing to investigate our current DNS configuration and exploring alternatives to avoid reaching the rate limit. INCOMPLETE_PERSPECTIVE < IMPOSSIBLE_TO_TEST

Posted Jan 21, 2025 - 07:03 GMT-05:00

Resolved
The platform has remained stable for several hours, and we are now closing this incident. A detailed postmortem report will be published to provide insights into the incident and the measures taken to address it.
Posted Jan 10, 2025 - 11:44 GMT-05:00
Monitoring
Since this morning, we have been experiencing delays and downtimes in our web application. The issue has been traced to a problem with DNS resolution within our Kubernetes cluster. Our team is actively working to resolve this and implementing necessary measures to prevent future occurrences.
Posted Jan 10, 2025 - 11:05 GMT-05:00
Identified
An issue was found in PLATFORM.
Click for details: https://availability.fluidattacks.com
Posted Jan 10, 2025 - 07:15 GMT-05:00
This incident affected: Platform.