Interruption of asynchronous processes in AWS Batch
Incident Report for Fluid Attacks
Postmortem

Impact

Several AWS Batch asynchronous tasks began to fail due to an issue with instance creation required for execution. The issue started on UTC-5 24-09-11 16:11 and was proactively discovered 15.8 hours (TTD) later by a staff member through the Auto Scaling configuration in Batch, which highlighted multiple affected environments. The problem was resolved in 1.4 hours (TTF) resulting in a total impact of 17 hours (TTR) [1].

Cause

A change in the cloud-init configuration file for our instances attempted to install an inaccessible dependency, disrupting instance creation and preventing AWS Batch jobs from running [2].

Solution

The team reverted the commit that introduced the problem [3].

Conclusion

Establishing a testing and validation process for changes to cloud-init files is necessary. Currently, there is no reliable method to ensure these changes are functional before deployment, and Terraform does not offer sufficient validation as it treats cloud-init files merely as resources. Implementing a robust testing approach, even on a local level, will help prevent similar issues in the future. INFRASTRUCTURE_ERROR < MISSING_TEST < INCOMPLETE_PERSPECTIVE

Posted Sep 12, 2024 - 17:47 GMT-05:00

Resolved
The incident has been resolved, and asynchronous processes in AWS Batch are now running normally.
Posted Sep 12, 2024 - 09:18 GMT-05:00
Identified
It has been detected that some asynchronous processes in AWS Batch encountered errors and failed to run properly. This disruption affected some computing environments.
Posted Sep 11, 2024 - 16:11 GMT-05:00
This incident affected: Cloning and Scanning.