Impact
At least three users encountered issues with the News button on the Platform, as it was not working correctly. The issue started on UTC-5 24-02-15 10:19 and was proactively discovered 1.9 hours (TTD) later by a staff member who reported through our help desk [1] that the News button was not working. The problem was resolved in 3.1 hours (TTF), resulting in a total window of exposure of 5 hours (WOE) [2].
Cause
The unexpected deletion of a Cloudflare resource impacted the availability of the News button on the Platform. This occurred because the button depended on specific resources managed by Cloudflare. With the resources unexpectedly deleted, the button could no longer work as expected [3].
Solution
Recreating the domain in Cloudflare and rebuilding its entire configuration from the code was necessary. However, Cloudflare required 1 to 2 hours to reconfigure the domain [4].
Conclusion
The incident highlighted the unpredictability of infrastructure deployment, especially with Terraform. We’ve learned to anticipate potential issues and ensure robust recovery processes. Moving forward, we’re enhancing monitoring and reviewing configuration to prevent similar incidents. INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE