[CA region only] Unplanned interruption to services

Incident Report for Hypercare

Postmortem

What Happened?

At approximately 11:15 am EST on Monday, March 2, 2026, core services in Canada experienced a system-wide downtime. The incident was caused by database connection exhaustion. The team suspects this was caused by a background process responsible for resetting user statuses (transitioning Hypercare users from “Unavailable” or “Busy” statuses back to “Available” after a set expiry date and time) generated a high volume of sustained, long-lived connections. The database has a fixed limit on permissible concurrent connections and these hanging connections saturated the pool, preventing any new requests from being processed.

Impact

All Hypercare services were inaccessible for Canadian users from approximately 11:15 am EST until 12:22 pm EST on Monday, March 2, 2026.

Resolution and Next Steps

The Engineering team restored services by manually terminating stalled connections and increased the capacity on the database pool. The team disabled the automated status reset feature to stabilize the environment and allow core services to resume normal operation. The automated status reset feature has been running intermittently in a controlled environment to ensure users status’ are reset appropriately.. 

To reduce the chances of a recurrence and improve our response time, the following actions are being taken:

  • Enhanced Monitoring: We are implementing additional early detection alerts for database connection utilization. This will allow us to intervene before the limit is reached.
  • Infrastructure Updates: We are increasing the number of permissible connections to the database to facilitate faster recovery during database restarts.
  • Rapid Recovery Protocol: While we finalize the permanent root cause fix, we have implemented a rapid recovery protocol, which will allow the team to instantly clear hanging connections, reducing potential recovery time from several minutes to under 60 seconds.

We apologize for the disruption caused by today’s unplanned downtime. We thank everyone for their patience and continued support.

Posted Mar 03, 2026 - 01:18 EST

Resolved

This incident has been resolved. A post mortem will be shared shortly.
Posted Mar 02, 2026 - 13:35 EST

Monitoring

A fix has been released and all core services have been restored. We're continuing to monitor the incident.
Posted Mar 02, 2026 - 12:23 EST

Identified

The issue has been identified and we are working on a fix.
Posted Mar 02, 2026 - 11:46 EST

Investigating

We are currently investigating an unplanned downtime of all core services.
Posted Mar 02, 2026 - 11:25 EST
This incident affected: Canadian Region (Login and Single Sign On (Canadian Region), Messaging (Canadian Region), Notifications and Real-Time Syncing (Canadian Region), File Attachments (Canadian Region), Viewing Who is On-Call (Canadian Region), Code Teams (Canadian Region), Self-serve Scheduling (Canadian Region), Administration and Scheduling (Canadian Region), API & Integrations (Canadian Region), Virtual Pager (Canadian Region)).