Impact
At least one user experienced service interruptions affecting the agent and other features that depend on the API. The issue started on UTC-5 25-12-29 22:25 and was proactively discovered 3.7 days (TTD) later by one of our monitoring tools, which reported failures in several system components. While investigating these alerts, the team identified that background processes required for the service to operate were not running correctly. The problem was resolved in 1.6 hours (TTF), resulting in a total window of exposure of 3.7 days (WOE) [1].
Cause
A required access key was missing from the production configuration. The key had been removed by mistake, which caused background processes to fail, and as a result, any functionality that depended on them stopped working [2].
Solution
The missing access key was restored in the production configuration file, allowing the system to authenticate properly and resume normal operations [3].
Conclusion
Once the key was restored, all affected components recovered, and the service returned to normal. This incident emphasizes the importance of safeguarding critical configuration values to prevent widespread service disruptions. INFRASTRUCTURE_ERROR