Our users experienced degradation and, in some cases, unavailability of our Platform, API, and related services. The issue started on UTC-5 25-01-09 17:31 and was proactively discovered 13.6 hours (TTD) later by our engineering team while reviewing synthetic monitoring failed alerts, error logs, and performance metrics [1]. The problem was resolved in 1.2 hours (TTF), resulting in a total window of exposure of 14.8 hours (WOE) [2].
Cause
As described in the AWS documentation, many requests to SQS reached the DNS rate limit. Consequently, requests for other services, such as DynamoDB, S3, and others, began to fail. The issue was linked to a change involving instrumenting read events for the Root entity the day before. This entity is frequently used, and there are many roots in some clients [3].
The instrumentation of read events for the Root entity was removed [4].
Our team enabled DNS caching for the entire cluster, significantly reducing the load on the real DNS, making it less likely to become overloaded. INCOMPLETE_PERSPECTIVE < IMPOSSIBLE_TO_TEST