Repository cloning failures

Incident Report for Fluid Attacks

Postmortem

Impact

An unknown number of groups had problems in the repository cloning process. The issue started on UTC-5 23-11-02 17:02 and was proactively discovered 4.6 days (TTD) later by a staff member who reported through our help desk [1] a No space left on device message in various groups inside the Platform. The problem was resolved in 4 hours (TTF), resulting in a total window of exposure of 4.8 days (WOE).

Cause

There was a change to the type of virtual server instances used by Fluid Attacks to execute some tasks. The new virtual servers that were processing the cloning of repositories have smaller storage space, and due to the architecture implementation of this process, the failure occurred, and the error message was displayed [2].

Solution

Some modifications were made to the cloning process architecture, reducing the number of tasks executed concurrently per instance [3].

Conclusion

Currently, there is no existing test for this part of the infrastructure because it is impossible to run it locally or test this kind of change before it goes to production [4]. Now, the product team is working on some changes to improve the cluster's robustness, reproducibility, and observability [5][6]. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST

Posted Nov 07, 2023 - 20:26 GMT-05:00

Resolved

The incident has been resolved and the repository cloning is working normally.
Posted Nov 07, 2023 - 16:35 GMT-05:00

Update

We are continuing to work on a fix for this issue.
Posted Nov 07, 2023 - 14:13 GMT-05:00

Identified

Some inconveniences have been identified in the process of repository cloning.
Posted Nov 07, 2023 - 09:24 GMT-05:00
This incident affected: Cloning.