Repository cloning failures
Incident Report for Fluid Attacks
Postmortem

Impact

An unknown number of groups had problems in the repository cloning process. The issue started on UTC-5 23-11-02 17:02 and was proactively discovered 4.6 days (TTD) later by a member of the Fluid Attacks team [1] who encountered a No space left on device message in various groups inside the Platform. The problem was resolved in 4 hours (TTF) resulting in a total impact of 4.8 days (TTR).

Cause

There was a change to the type of virtual server instances used by Fluid Attacks to execute some tasks [2]. The new virtual servers that were processing the cloning of repositories have smaller storage space, and due to the architecture implementation of this process, the failure occurred, and the error message was displayed.

Solution

Some modifications were made to the architecture of the cloning process, reducing the number of tasks executed concurrently per instance [3].

Conclusion

Currently, there is no existing test for this part of the infrastructure because it is impossible to run it locally or test this kind of change before it goes to production [4]. Now, the product team is working on some changes to improve the cluster's robustness, reproducibility, and observability [5][6]. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST

Posted Nov 07, 2023 - 20:26 GMT-05:00

Resolved
The incident has been resolved and the repository cloning is working normally.
Posted Nov 07, 2023 - 16:35 GMT-05:00
Update
We are continuing to work on a fix for this issue.
Posted Nov 07, 2023 - 14:13 GMT-05:00
Identified
Some inconveniences have been identified in the process of repository cloning.
Posted Nov 07, 2023 - 09:24 GMT-05:00
This incident affected: Cloning.