Impact
Several AWS Batch asynchronous tasks began to fail due to an issue with instance creation required for execution. The issue started on UTC-5 24-09-11 16:11 and was proactively discovered 15.8 hours (TTD) later by a staff member through the Auto Scaling configuration in Batch, which highlighted multiple affected environments. The problem was resolved in 1.4 hours (TTF) resulting in a total impact of 17 hours (TTR) [1].
Cause
A change in the cloud-init configuration file for our instances attempted to install an inaccessible dependency, disrupting instance creation and preventing AWS Batch jobs from running [2].
Solution
The team reverted the commit that introduced the problem [3].
Conclusion
Establishing a testing and validation process for changes to cloud-init files is necessary. Currently, there is no reliable method to ensure these changes are functional before deployment, and Terraform does not offer sufficient validation as it treats cloud-init files merely as resources. Implementing a robust testing approach, even on a local level, will help prevent similar issues in the future. INFRASTRUCTURE_ERROR < MISSING_TEST < INCOMPLETE_PERSPECTIVE