Stealth outages are subtle. Compared to outages that cause a service to go completely offline or otherwise render it inaccessible to users, stealth outages may present as intermittent issues, or issues so imperceptible that many users may not even notice there’s a problem.


A complete system outage often leads to a surge of user complaints and even media coverage, but a stealth outage, while less noticeable, can still be quite harmful. In many ways, the complexity of identifying a stealth outage, determining its root cause, pinpointing its location, and determining who is responsible makes resolving these issues more challenging. Additionally, because stealth outages can easily go unnoticed, they may persist for longer periods, resulting in more significant usability problems over time.


Stealth outages often occur due to the design for optimized performance of modern applications and services, which are built on highly distributed architectures. In today’s AI era, where numerous AI and automation scripts are added to the backend, a new type of stealth outage is emerging.


The unnoticeable stealth outage?


The stealth outage is a modern twist on a classic philosophical thought experiment: If an application experiences a functional failure but everything appears normal and no one notices anything amiss, did an outage actually occur?


This was illustrated by a recent global disruption that affected the ability of Slack users to send messages, load channels or threads, or use integrated applications. Despite the functional failures, the outage’s outward-facing symptoms didn’t include any glaring indicators of something being awry, such as blank screens or obvious error messages—the messages sent simply weren’t actually sent, unbeknownst to the sender. Unless users were actively engaged in sending messages or were expecting a message that they did not receive, they likely remained oblivious to there even being an outage. This is reinforced by the fact that the incident garnered little public attention, despite widespread use of the Slack application.


There have long been debates on the threshold for reporting or disclosing an outage. As we’ve seen over the years, it’s not unusual for official outage notifications to lag the first signs of a degraded user experience, oftentimes by hours after an event is first felt by users.


In the AI era network, architectural complexities increase and along with it, the end-to-end interdependencies that all stand and fall like dominos in the case of an issue anywhere along the chain that powers applications and online services.



Comarch
Comarch


The AI-on-AI stealth outage


Increasing automation to reduce operational and administrative overhead in network and application architectures has been a long-standing goal of operations teams. As we move deeper into the Agentic AI world, multi-agent systems – where the activities of agents are orchestrated to execute individual components of an end-to-end transaction or process in a coordinated manner – will become the rule rather than the exception. This will be constrained only by the consistency of execution that these multi-agent systems can achieve.


On their journey toward agentic automation, it’s common for multiple teams to contribute to a single application or product, while implementing their own AI or automations within each of their specific domains. Without central governance, the potential for one AI or automation to react unpredictably to the actions of another, creating an outage scenario, is very real.


This situation could lead to a stealth outage because an application developer or infrastructure owner that lacks end-to-end visibility of what AI or automated scripts exist across their environment will experience problems diagnosing an issue when one of them intervenes in a process, let alone when several interact in quick succession.


An organization’s own automated mitigations today can have a ‘butterfly effect’ – where a single anomalous detection triggers an automated action that, in turn, triggers other automated interventions. This produces a much bigger set of issues in the process of trying to automatically mitigate against a smaller one.



Comarch
Comarch


Without visibility across all these dependencies, including nested ones, an organization may overlook scripts being triggered individually across various component subdomains, only becoming aware once multiple components degrade and a customer-facing impact is felt.


We have yet to see – or, more accurately, learn of a public disclosure of – a stealth outage that’s completely AI- or automation-driven, but the signs are there that one is not far off. Last year, a routine DDoS mitigation by Microsoft in its Azure cloud turned out to be anything but, when “an unrelated latent network configuration” caused external healthy traffic to be accidentally routed through the DDoS protection system within Europe. This led to localised congestion, which caused customers to experience high latency and connectivity failures across multiple regions. This showed the potential for an interaction of systems – independent of human oversight – to combine in the background and make an incident worse, and the challenge of diagnosing root cause in complex environments.


The transformative power of AI is truly remarkable, similar to previous major technological disruptions such as cloud computing, the Internet, and mobile technology. The rapid interactions between agents, services, and APIs are creating new traffic patterns and introducing more dynamic, cross-domain dependencies than ever before.


Just as enterprises have implemented technologies to monitor and manage their cloud environments, identifying and preventing even the most subtle outages, they must now adapt to the era of AI. This requires new approaches to monitoring operations both within and outside their owned perimeter. Businesses need to drive operational insights by focusing on contextual information and determining how to measure what matters, and only that.


The views expressed in this article belong solely to the author and do not represent The Fast Mode. While information provided in this post is obtained from sources believed by The Fast Mode to be reliable, The Fast Mode is not liable for any losses or damages arising from any information limitations, changes, inaccuracies, misrepresentations, omissions or errors contained therein. The heading is for ease of reference and shall not be deemed to influence the information presented.