Impact
An unknown number of analysts presented inconsistencies in their progress reports, as they only showed data as zeroes when they should have displayed different values. The issue started on UTC-5 24-07-10 09:29 and was proactively discovered 21.8 hours (TTD) later by a staff member who reported through our help desk [1] that they received the report with inconsistencies. The problem was resolved in 1.2 days (TTF) resulting in a total impact of 2.1 days (TTR) [2].
Cause
An incorrect setup of a new software tool meant to speed up processing caused a problem. It didn’t control how many tasks could ask the database for information at once, making it hard to get the right data needed for the analysts' progress reports. As a result, the reports showed empty numbers that didn’t show what the analysts did [3].
Solution
It was necessary to limit how many tasks could simultaneously access group information during processing [4].
Conclusion
Testing locally didn’t consider the large number of groups in production, which caused us to miss the error before deployment. To prevent this in the future, whenever we make changes to processing settings or memory use, we’ll run tasks in production without generating reports to ensure they handle the full volume of data successfully. INFRASTRUCTURE_ERROR