Impact
At least one user experienced a brief service interruption. The issue started on UTC-5 25-12-04 14:20 and was proactively discovered 1 minute (TTD) later by a staff member who noticed that the platform stopped responding for a few minutes during an update. The problem was resolved in 3 minutes (TTF), resulting in a total window of exposure of 4 minutes (WOE) [1].
Cause
While we were improving the machines that run the service, the system replaced the old ones with new ones. However, it turned off all the old machines before the new ones were fully ready, leaving the platform with no way to handle incoming requests for a short period of time [2].
Solution
A safeguard was added so the system cannot turn off any machine until at least one new one is fully ready. This ensures the service always stays available during updates [3].
Conclusion
The fix applied ensures that new machines must be ready before old ones are removed. We are also working on performance improvements and will continue reviewing how the system behaves in these situations to see if more protections are needed. INFRASTRUCTURE_ERROR < PERFORMANCE_DEGRADATION