Major Incident Category
Service Degradation
Post Mortem Owner
Ant Hurlock
Date Post Mortem Completed (UTC)
04 Mar 2026, 17:30
Incident Summary
On 26th February 2026, a small percentage of ConnectMe users in the EU West region experienced service degradation between 08:19 and 09:22 UTC. Affected users were connected to a single platform node and encountered varied symptoms, including blank screens after login and disruptions to calling and related functionality. Although the overall number of impacted users was limited, the disruption for those affected was significant. Engineers applied mitigation steps, and all services were fully restored and confirmed stable by 09:22 UTC.
Root Cause
The incident was triggered by a software fault within a core platform component. A planned update containing improvements and fixes for two related issues had not yet been deployed, meaning these enhancements were not available in the production environment.
The first issue caused the affected platform node to become unresponsive, and the diagnostic information required to quickly identify the cause was insufficient. The upcoming update will introduce enhanced logging to provide the visibility needed to detect and analyse node deadlock scenarios in real time.
The second issue involved the node’s health‑monitoring probe. A related software defect meant the probe could take significantly longer than expected to detect and recover from failures, extending the automated recovery window to as much as two hours. Although the platform is designed to self‑recover within this timeframe, in this case recovery did not occur until engineers intervened manually to restore service.
Incident Resolution
The issue was first detected by automated monitoring at 08:19 UTC, prompting immediate investigation by our engineering team. Initial recovery efforts were delayed due to an incorrect procedure, which terminated the unhealthy pod but left the health‑check waiting on internal conditions that were not relevant to this type of failure. As a result, the health‑check began a recovery cycle that could have taken up to two hours to complete. Engineers then intervened manually, restarting the affected component and restoring normal service for all impacted users at 09:22 UTC. Service stability was confirmed following this intervention.
Mitigative Actions