Impact
At least three groups experienced problems with the CSPM vulnerability reporting of our machine services. The issue started on UTC-5 24-11-28 23:27 and was proactively discovered 13.3 days (TTD) later by a staff member who reported to our product team that a group did not have CSPM reports on a recently added cloud environment. The problem was resolved in 2.8 hours (TTF) resulting in a total impact of 13.4 days (TTR).
Cause
The CSPM analysis runs daily on a scheduler. Due to an error in a refactor, this scheduler failed to execute the groups with that configuration [1].
Solution
The scheduler was fixed, and an execution was immediately queued for all groups [2].
Conclusion
The lack of adequate testing and monitoring for our machine services allowed this error to go unnoticed by the team. Tests were added to the code, and the team will implement extra monitoring tools for our machine services scanner. MISSING_TEST