This incident began approximately on Monday 15/02/2021 13:50 and ended approximately on Monday 15/02/2021 21:00 (all times are GMT).
Description of the impact
The new Developer Experience had a slow query impacting customers migrating on such a platform. As more customers were being migrated more workloads were generated and customers also not on the new developer experience started experiencing delays on review processing.
Timeline and Root Cause
Starting from 15/02/2021 11:45 GMT the CPU Utilisation p99 went up to 99% therefore the cluster has been scaled up for the first time.
Starting from 13:50 GMT also the CPU Utilisation p50 went up to 99% starting the actual incident. At 14:00 GMT the product engineering team started to investigate the issue as the first deployed a new detailed monitoring configuration to gain more visibility on the database cluster.
At 16:30 GMT we were able to identify a query that was causing the congestion on the database and therefore delays on the review processing queue.
At 18:19 GMT we deployed the patch and also scaled up the cluster to speed up recovery time.
At 21:16 GMT incident has been resolved since review processing times were back to normal levels consistently
Detection, Remediation and Prevention
In the short term, we are focusing on additional database monitoring improvements and query optimisations.
In the long term, our engineering team is planning to introduce deeper performance testing early in the development phase.
Posted Feb 16, 2021 - 14:18 GMT
Resolved
This incident has been resolved.
Posted Feb 15, 2021 - 21:16 GMT
Update
Processing is gradually going back to normal levels. We keep monitoring the situation.
Posted Feb 15, 2021 - 21:00 GMT
Update
We have released a fix and continue monitoring our instances as we are slowly clearing up our backlog of requests
Posted Feb 15, 2021 - 19:53 GMT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 15, 2021 - 18:19 GMT
Investigating
We are currently investigating this issue.
Posted Feb 15, 2021 - 17:13 GMT
Update
We are continuing to monitor for any further issues.
Posted Feb 15, 2021 - 16:59 GMT
Monitoring
We are making some unplanned maintenance in order to address today's issues - we'll continue to post updates on the situation
Posted Feb 15, 2021 - 16:30 GMT
Investigating
Our code review processing infrastructure is running behind which is causing latency in the reporting and slower-than-normal response times.
Posted Feb 15, 2021 - 14:00 GMT
This incident affected: Clayton (Code Review Workers, Web).