General availability issues

Incident Report for Fluid Attacks

Postmortem

Impact

An unknown number of users found the Fluid Attacks Platform, the API, and the Agent unavailable. The issue started on UTC-5 24-03-05 19:47 and was proactively discovered 6 minutes (TTD) later by one of our monitoring tools, indicating the total outage of these components. The problem was resolved in 14.4 minutes (TTF), resulting in a total window of exposure of 20.4 minutes (WOE).

Cause

During infrastructure preparation for IPv6 utilization, a routing change was made for our cloud environment, documented in this issue [1] and implemented through this MR [2]. However, AWS did not immediately reflect this change, causing system unavailability due to synchronization issues. This delay in routing change implementation led to network subnets losing internet access.

Solution

The team redeployed the change to synchronize it in the cloud, enabling instances to regain internet access.

Conclusion

Given that the issue stemmed from an AWS error, it was impossible to test beforehand. However, moving forward, we will consider potential latencies when applying routing changes within the AWS network. While it may be challenging to prevent outages due to this completely, the priority of addressing such incidents is relatively low, considering these components are rarely modified. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST

Posted Mar 06, 2024 - 18:25 GMT-05:00

Resolved

The issue was resolved and now all Fluid Attacks products are fully available.

Posted Mar 05, 2024 - 22:24 GMT-05:00

Identified

Some Fluid Attacks products are experiencing availability issues. Click for details: https://availability.fluidattacks.com

Posted Mar 05, 2024 - 19:53 GMT-05:00

This incident affected: Platform, Agent, and Extensions.