Impact
At least one user observed inconsistencies in the platform’s indicators. The issue started on UTC-5 25-07-10 10:21 and was proactively discovered 1 day (TTD) later by a staff member who noticed that the system responsible for keeping these indicators up to date was failing, which led to incorrect calculations. The problem was resolved in 2.4 hours (TTF), resulting in a total window of exposure of 1.1 days (WOE) [1].
Cause
A large update was made to many records in the system at the same time. This triggered multiple automated processes that calculate and update indicators. Since too many processes were running at once, the system became overloaded and could not finish the calculations, causing it to get stuck in a continuous error loop [2].
Solution
The number of processes running at the same time was reduced, and each one handles more updates before finishing. This balance helped the system recover and work normally again [3].
Conclusion
By allowing fewer simultaneous processes and giving each process more work to do, we prevented the system from overloading. We also set up alerts to quickly detect if this issue happens again. These measures were enough to stabilize the platform. DATA_QUALITY < PERFORMANCE_DEGRADATION < INCOMPLETE_PERSPECTIVE