Review Processing Delays
Incident Report for Clayton
Postmortem

This incident began approximately on Monday 15/02/2021 13:50 and ended approximately on Monday 15/02/2021 21:00 (all times are GMT).

Description of the impact

The new Developer Experience had a slow query impacting customers migrating on such a platform. As more customers were being migrated more workloads were generated and customers also not on the new developer experience started experiencing delays on review processing.

Timeline and Root Cause

  • Starting from 15/02/2021 11:45 GMT the CPU Utilisation p99 went up to 99% therefore the cluster has been scaled up for the first time.
  • Starting from 13:50 GMT also the CPU Utilisation p50 went up to 99% starting the actual incident. At 14:00 GMT the product engineering team started to investigate the issue as the first deployed a new detailed monitoring configuration to gain more visibility on the database cluster.
  • At 16:30 GMT we were able to identify a query that was causing the congestion on the database and therefore delays on the review processing queue.
  • At 18:19 GMT we deployed the patch and also scaled up the cluster to speed up recovery time.
  • At 21:16 GMT incident has been resolved since review processing times were back to normal levels consistently

Detection, Remediation and Prevention

  • In the short term, we are focusing on additional database monitoring improvements and query optimisations.
  • In the long term, our engineering team is planning to introduce deeper performance testing early in the development phase.
Posted Feb 16, 2021 - 14:18 GMT

Resolved
This incident has been resolved.
Posted Feb 15, 2021 - 21:16 GMT
Update
Processing is gradually going back to normal levels. We keep monitoring the situation.
Posted Feb 15, 2021 - 21:00 GMT
Update
We have released a fix and continue monitoring our instances as we are slowly clearing up our backlog of requests
Posted Feb 15, 2021 - 19:53 GMT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 15, 2021 - 18:19 GMT
Investigating
We are currently investigating this issue.
Posted Feb 15, 2021 - 17:13 GMT
Update
We are continuing to monitor for any further issues.
Posted Feb 15, 2021 - 16:59 GMT
Monitoring
We are making some unplanned maintenance in order to address today's issues - we'll continue to post updates on the situation
Posted Feb 15, 2021 - 16:30 GMT
Investigating
Our code review processing infrastructure is running behind which is causing latency in the reporting and slower-than-normal response times.
Posted Feb 15, 2021 - 14:00 GMT
This incident affected: Clayton (Code Review Workers, Web).