Impact
At least one user observed disruptions across multiple features. The issue started on UTC-5 25-09-04 21:52 and was proactively discovered 43.2 minutes (TTD) later by a staff member who noticed that GitLab pipeline jobs began to fail. This quickly escalated: the Fluid Agent and API became unavailable, and eventually the entire platform went down. As a result, data processing tasks (ETLs) were delayed, affecting internal analytics and billing. The scanning system failed to update its database and could not process new scans. Platform queues were frozen, blocking reports, reattacks, and cloning. Additionally, downloadable technical and executive reports could not be generated. Overall, the Agent, API, Scanner, and Platform were all impacted. However, since this occurred outside of business hours, no clients were affected and no external reports were generated. The problem was resolved in 7.9 hours (TTF), resulting in a total window of exposure of 8.6 hours (WOE).
Cause
During an infrastructure update, some permissions were removed before new permissions were applied. This left the system without the necessary authorizations to continue the update, which caused critical services to stop working [1].
Solution
The missing permissions were restored, and the update was re-applied manually, which corrected most of the issues. Some configurations were out of sync with the real environment, so they were cleaned up and recreated to bring the system back into alignment. After these actions, all affected services were restored.
Conclusion
We are evaluating safeguards so that future infrastructure updates can detect these kinds of conflicts in advance and raise a clear error, preventing critical permissions from being removed and avoiding system downtime. INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE