What Happened?
At 20:10 EST, the Customer Support and Engineering teams were alerted to ongoing delays in push notifications specifically affecting users in Canada. The incident coincided with a non-routine manual backup of Canadian production databases, which was being performed in preparation for upcoming deployments.
The manual backup process consumed significant system resources, creating a bottleneck that caused significant latency in our notification services. Once a potential correlation between the backup and the latency was identified, the team immediately halted all manual backup operations. Following the termination of these tasks, system performance stabilized, and services returned to normal operations by 20:35 EST.
Impact
Users located in Canada likely experienced significant latency in notification delivery across several Hypercare services between 19:00 and 20:30 EST. The scope of the impact included:
Resolution and Next Steps
The immediate resolution was achieved by stopping the manual database backup. As this was a non-routine event, existing automated monitors did not initially flag the resource exhaustion as a service-level threat. To prevent a recurrence and improve detection, the following actions have been implemented: