Description
An incident has corrupted the DNS configuration of our production application, placing our systems in maintenance mode and causing all APIs calls to be rejected with a HTTP status code 503.
Impact
All the users connecting to our web application during the incident were redirected to a blank page or to the login page of a non-production environment, that they could not access. All API calls and webhook deliveries were rejected.
Timeline
Root cause
The issue was caused by a bug in our CI/CD deployment pipeline that was overriding a particular DNS configuration incorrectly when concurrent deployments were taking place across different environments at the exact same time.
Mitigation and recovery
We were able to manually restore the DNS configuration and were able to recover the majority of web hooks delivery events, processing them with a delay. We have since identified the bug and are updating our deployment scripts to prevent this from happening in the future.