Impact
At least three users encountered issues with the News button on the Platform, as it was not working properly. The issue started on UTC-5 24-02-15 10:19 and was proactively discovered 1.9 hours (TTD) later by a staff member who reported through our help desk [1] that the News button was not working. The problem was resolved in 3.1 hours (TTF) resulting in a total impact of 5 hours (TTR) [2].
Cause
The unexpected deletion of a Cloudflare resource impacted the availability of the News button on the Platform. This occurred because the button depended on specific resources managed by Cloudflare. With the resources unexpectedly deleted, the button could no longer work as expected [3].
Solution
It was necessary to recreate the domain in Cloudflare and rebuilt its entire configuration from the code. However, Cloudflare required a waiting period of 1 to 2 hours to reconfigure the domain [4].
Conclusion
The incident highlighted the unpredictability of infrastructure deployment, especially with Terraform. We’ve learned to anticipate potential issues and ensure robust recovery processes. Moving forward, we’re enhancing monitoring and reviewing configuration to prevent similar incidents. INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE