Impact
An unknown number of analysts experienced issues with the progress reports. The issue started on UTC-5 24-07-10 09:29 and was proactively discovered 21.8 hours (TTD) later by a staff member who reported through our help desk [1] that they received the report with inconsistencies. The problem was resolved in 1.2 days (TTF), resulting in a total window of exposure of 2.1 days (WOE) [2].
Cause
An incorrect setup of a new software tool meant to speed up processing caused a problem. It didn’t control how many tasks could ask the database for information at once, making it hard to get the right data needed for the analysts' progress reports. As a result, the reports showed empty numbers that didn’t show what the analysts did [3].
Solution
It was necessary to limit how many tasks could simultaneously access group information during processing [4].
Conclusion
Testing locally didn’t consider the large number of groups in production, which caused us to miss the error before deployment. To prevent this in the future, whenever we make changes to processing settings or memory use, we’ll run tasks in production without generating reports to ensure they handle the full volume of data successfully. INFRASTRUCTURE_ERROR