Impact
An unknown number of groups had problems in the repository cloning process. The issue started on UTC-5 23-11-02 17:02 and was proactively discovered 4.6 days (TTD) later by a member of the Fluid Attacks team [1] who encountered a No space left on device message in various groups inside the Platform. The problem was resolved in 4 hours (TTF) resulting in a total impact of 4.8 days (TTR).
Cause
There was a change to the type of virtual server instances used by Fluid Attacks to execute some tasks [2]. The new virtual servers that were processing the cloning of repositories have smaller storage space, and due to the architecture implementation of this process, the failure occurred, and the error message was displayed.
Solution
Some modifications were made to the architecture of the cloning process, reducing the number of tasks executed concurrently per instance [3].
Conclusion
Currently, there is no existing test for this part of the infrastructure because it is impossible to run it locally or test this kind of change before it goes to production [4]. Now, the product team is working on some changes to improve the cluster's robustness, reproducibility, and observability [5][6]. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST