ConnectMe – Login Issues and Inbound/Outbound Call Failures

Incident Report for Dstny

Postmortem

Major Incident Category
Service Degradation
Post Mortem Owner
Ant Hurlock
Date Post Mortem Completed (UTC)
04 Mar 2026, 17:30

Incident Summary
On 26th February 2026, a small percentage of ConnectMe users in the EU West region experienced service degradation between 08:19 and 09:22 UTC. Affected users were connected to a single platform node and encountered varied symptoms, including blank screens after login and disruptions to calling and related functionality. Although the overall number of impacted users was limited, the disruption for those affected was significant. Engineers applied mitigation steps, and all services were fully restored and confirmed stable by 09:22 UTC.

Root Cause
The incident was triggered by a software fault within a core platform component. A planned update containing improvements and fixes for two related issues had not yet been deployed, meaning these enhancements were not available in the production environment.

The first issue caused the affected platform node to become unresponsive, and the diagnostic information required to quickly identify the cause was insufficient. The upcoming update will introduce enhanced logging to provide the visibility needed to detect and analyse node deadlock scenarios in real time.

The second issue involved the node’s health‑monitoring probe. A related software defect meant the probe could take significantly longer than expected to detect and recover from failures, extending the automated recovery window to as much as two hours. Although the platform is designed to self‑recover within this timeframe, in this case recovery did not occur until engineers intervened manually to restore service.

Incident Resolution
The issue was first detected by automated monitoring at 08:19 UTC, prompting immediate investigation by our engineering team. Initial recovery efforts were delayed due to an incorrect procedure, which terminated the unhealthy pod but left the health‑check waiting on internal conditions that were not relevant to this type of failure. As a result, the health‑check began a recovery cycle that could have taken up to two hours to complete. Engineers then intervened manually, restarting the affected component and restoring normal service for all impacted users at 09:22 UTC. Service stability was confirmed following this intervention.

Mitigative Actions

  • Accelerating deployment of the planned software update.
  • Improving the speed and reliability of automatic recovery by resolving the issue that previously caused the health‑monitoring probe to delay self‑recovery by up to two hours.
  • Introducing enhanced logging to support faster diagnosis of node deadlock scenarios and progress towards a permanent fix.
  • Updating runbooks to provide clearer guidance for rapid recovery and validation of service restoration.
Posted Mar 05, 2026 - 10:43 UTC

Resolved

This incident has been resolved.
Posted Mar 03, 2026 - 16:41 UTC

Monitoring

Our Platform team has identified the root cause of the issue and implemented corrective measures to restore application services.

We will continue to monitor service availability for the next 24 hours. and do not anticipate any further impact at this time.

Thank you.

Dstny Support.
Posted Feb 26, 2026 - 10:23 UTC

Investigating

We are currently investigating an issue affecting ConnectMe in EU West. This is impacting a subset of users and includes login issues as well as inbound and outbound call failures in the affected areas.

Our teams are working to identify the root cause and implement a resolution.

Updates will be provided every 60 minutes as we learn more.

We apologise for any inconvenience caused and appreciate your patience during this time.

Dstny Support
Posted Feb 26, 2026 - 10:21 UTC
This incident affected: ConnectMe (EU).