Our users experienced degradation and, in some cases, unavailability of our Platform, API, and related services. The issue started on UTC-5, 25-01-09 at 17:31 and was reactively discovered 14 hours (TTD) later by our engineering team while reviewing synthetic monitoring failed alerts, error logs, and performance metrics[1]. The problem was resolved within 1 hour (TTF), resulting in a total impact of 15 hours (TTR) [2].
Cause
A high number of requests to SQS reached the DNS rate limit, as described in the AWS documentation. Consequently, requests to other services such as DynamoDB, S3, and others began to fail. The issue was linked to a change made the day before [3], which involved instrumenting read events for the Root entity. This entity is frequently used, and in some clients, there is a large number of roots.
The instrumentation of read events for the Root entity was removed [4].
Our team is continuing to investigate our current DNS configuration and exploring alternatives to avoid reaching the rate limit. INCOMPLETE_PERSPECTIVE < IMPOSSIBLE_TO_TEST