DNS issue
Incident Report for Clayton
Postmortem

Description

An incident has corrupted the DNS configuration of our production application, placing our systems in maintenance mode and causing all APIs calls to be rejected with a HTTP status code 503.

Impact

All the users connecting to our web application during the incident were redirected to a blank page or to the login page of a non-production environment, that they could not access. All API calls and webhook deliveries were rejected.

Timeline

  • The incident started at 2022-05-25 08:40:31 UTC+0200
  • The incident was resolved at 2022-05-25 09:09:02 UTC+0200

Root cause

The issue was caused by a bug in our CI/CD deployment pipeline that was overriding a particular DNS configuration incorrectly when concurrent deployments were taking place across different environments at the exact same time.

Mitigation and recovery

We were able to manually restore the DNS configuration and were able to recover the majority of web hooks delivery events, processing them with a delay. We have since identified the bug and are updating our deployment scripts to prevent this from happening in the future.

Posted May 26, 2022 - 18:41 BST

Resolved
The web application is not reachable due to a DNS configuration issue
Posted May 25, 2022 - 08:40 BST