Service availability issues

Incident Report for Fluid Attacks

Postmortem

Impact

At least one user experienced a brief service interruption. The issue started on UTC-5 25-12-04 14:20 and was proactively discovered 1 minute (TTD) later by a staff member who noticed that the platform stopped responding for a few minutes during an update. The problem was resolved in 3 minutes (TTF), resulting in a total window of exposure of 4 minutes (WOE) [1].

Cause

While we were improving the machines that run the service, the system replaced the old ones with new ones. However, it turned off all the old machines before the new ones were fully ready, leaving the platform with no way to handle incoming requests for a short period of time [2].

Solution

A safeguard was added so the system cannot turn off any machine until at least one new one is fully ready. This ensures the service always stays available during updates [3].

Conclusion

The fix applied ensures that new machines must be ready before old ones are removed. We are also working on performance improvements and will continue reviewing how the system behaves in these situations to see if more protections are needed. INFRASTRUCTURE_ERROR < PERFORMANCE_DEGRADATION

Posted Dec 12, 2025 - 15:45 GMT-05:00

Resolved

The incident has been resolved, and the platform is now operating normally.

Posted Dec 04, 2025 - 14:24 GMT-05:00

Identified

It has been identified that the Platform is experiencing a temporal accessibility degradation.

Posted Dec 04, 2025 - 12:21 GMT-05:00

This incident affected: Platform.