Impact
An unknown number of users found the Fluid Attacks Platform, the API, and the Agent unavailable. The issue started on UTC-5 24-03-05 19:47 and was proactively discovered 6 minutes (TTD) later by one of our monitoring tools, indicating the total outage of these components. The problem was resolved in 14.4 minutes (TTF), resulting in a total window of exposure of 20.4 minutes (WOE).
Cause
During infrastructure preparation for IPv6 utilization, a routing change was made for our cloud environment, documented in this issue [1] and implemented through this MR [2]. However, AWS did not immediately reflect this change, causing system unavailability due to synchronization issues. This delay in routing change implementation led to network subnets losing internet access.
Solution
The team redeployed the change to synchronize it in the cloud, enabling instances to regain internet access.
Conclusion
Given that the issue stemmed from an AWS error, it was impossible to test beforehand. However, moving forward, we will consider potential latencies when applying routing changes within the AWS network. While it may be challenging to prevent outages due to this completely, the priority of addressing such incidents is relatively low, considering these components are rarely modified. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST