By Vijay Verma, Persistent Systems

Enterprises pivot to the cloud with various goals, such as cost reduction, competitive agility, or scaling AI, among others. But the likelihood of these expectations falling flat remains high because enterprises are yet to align their cloud operating model with business goals.

For instance, they source monitoring tools to flag drifts in application behavior, performance, and responsiveness based on predetermined thresholds, like they would in an on-premises setup. However, cloud environments are dynamic, especially hybrid deployments, with containers and microservices that rapidly assemble and disassemble.

Static thresholds create high-dimensional data that only adds to the complexity. Without risk-based prioritization, this data deluge creates numerous alerts that lead to alert fatigue, false positives/negatives, and noise. A BCG survey revealed that more than 14% of organizations deploy 50 or more monitoring tools. More than 70% of the data they collect is unnecessary, leading to:

  • Inflated costs: The global market for commercial IT monitoring grew at a compound annual rate of 11.2% between 2018 and 2020, while the total IT market grew by less than 3%.

  • Risks of downtime: Failure to predict or timely remediate issues can increase the possibility of downtime. Global 2000 companies lose an estimated $400 billion annually to system failure or crawl, compounded by the loss of customer trust and reputational damage.

  • Complexity loops: With numerous tools, teams, and dynamic environments, 74% of CIOs state that the growing complexity makes efficient management extremely challenging.

Related:The Great Cloud Reversal: Why IT Teams Are Moving Back to Dedicated Infrastructure

Proactively identifying and remediating operational issues requires AI-infused observability systems driven by machine learning (ML), natural language processing (NLP), and generative AI (GenAI). With foundational visibility, these AI-embedded systems can help triage issues, improve performance, reduce expenses, and sustainably achieve cloud objectives.

The Missing Part: Data Analysis & Actionable Insights

As cloud monitoring evolves into observability, the health, performance, and behavior of applications, systems, and infrastructure are gauged through telemetry data, such as events, logs, and traces that are correlated with timestamps, resource IDs, and error codes. Observability platforms help IT teams with holistically represented signals from multiple cloud environments that can map how different elements mutually interact to influence performance, security, and management.

Related:Edge Computing Trends: Adoption, Challenges, and Future Outlook

Although observability platforms consolidate signals, they do not provide a path to remediation or insights that predict failures or detect anomalies, leaving cloud monitoring still reactive. It still requires a human to analyze these signals. These systems inherently cannot map how changes made in one part of the cloud estate impact the larger ecosystem, or how, for instance, a utilization strategy can help meet financial and sustainability goals.

Fixing the ‘Observability Gap’ with AI

The classic dilemma with having a lot of data is knowing what to do with it. While observability consolidates telemetry in a single-pane view to remove data silos and point toward a real-time trend, it still does not answer how to pre-empt or triage risks. This know-how exists as tribal knowledge or is primarily dependent on a trial-and-error method, creating bottlenecks to timely resolution. An IBM study found more than half of the organizations globally had severe or high-level staffing shortages and experienced an average of $1.76 million in higher security breach costs as a result. 

Solutions exist that help clients simplify cloud observability with conversational AI that interacts with observability data, eliminating the need for manual sifting through alerts and dashboards. The underlying AI correlates across all four dimensions of metrics, events, logs, and traces and leverages OpenTelemetry to create a unified view of signals across cloud environments.

Related:Cloud Computing vs. Cloud Networking: Related but Distinct IT Domains

Embedding AI into cloud observability can help:

  • Identify issues with weak drifts: ML models learn from baseline behavior (e.g., daily traffic patterns, resource consumption, number of inference requests, etc.) to flag minor deviations that point to a brewing issue. They analyze historical data as well as real-time telemetry to pick up anomalies, even with weak signals, triggering mitigation plans to pre-empt issues such as latency, scalability, security breaches, or cost overruns. AI helps reduce false alerts, i.e., signals that point to a non-issue or signals that fail to point to a critical issue, with up to 98% accuracy, and can potentially reduce 40 hours of manual triage per week.

  • Significantly improve resolution times: By cross-referencing a large volume of telemetry data (logs, traces, metrics) to identify hidden dependencies, AI can segregate alerts with severity, allowing teams to cut noise and focus on the ones that point to a real incident. Organizations that deploy AI into observability are 2.3 times more likely than non-AI observability users to detect and mitigate issues within minutes or hours, compared to weeks or even months. AI can help teams collaborate more effectively by translating technical issues into actionable steps and can also ensure similar issues do not occur in the future.

  • Tap into competitive agility: AI observability systems provide a deeper overlay of the infrastructure, application, and data layers, using time-series forecasting to identify issues such as gradual resource leaks and spikes in latency, and proactively predicting failure before critical thresholds are breached. AI automatically triggers corrective measures or alerts concerned teams to make modifications before an outage causes downtime. This frees up developer bandwidth to focus on the core job instead of troubleshooting incidents, with organizations with mature AI observability generating 2.6x more code on demand than their non-AI counterparts.

  • Improve scalability and flexibility:AI can map service dependencies to identify cascading failures across distributed systems. Traditional monitoring setups often fail to detect these issues because they only evaluate components in isolation. By handling petabytes of data across distributed systems, AI can analyze the dynamic usage and patterns of containers, Kubernetes, and other serverless architectures in the cloud, adapting to changing contexts and providing real-time insights into how components behave, ensuring scalability and flexibility.

  • Create context-aware alerting: AI can help prioritize alerts based on business impact (e.g., payment gateway errors over non-critical service delays) and suppress low-priority alerts during peak business hours, ensuring the team’s resources and bandwidth are dedicated to handling business-critical exceptions.

  • Ensure continuous learning: Models retrain with new data, adapting to infrastructure changes (e.g., dynamically scaling Kubernetes clusters), leading to self-healing, self-learning cloud operations that are proactive and fault-resistant.

Way Forward for AI-Shy Enterprises

AI-based observability is still evolving and has not yet achieved full maturity. Despite the benefits, organizations express concerns about safeguarding privacy and sensitive information when implementing AI for observability. CIOs acknowledge that AI-enhanced observability is still in its developmental phase, and comprehending how AI models process data and derive insights while ensuring security can be challenging.

To address these challenges, organizations can implement measures, even if that involves trade-offs. For instance, issues related to governance, security, and compliance can be mitigated through techniques such as data redaction and private hosting to restrict the dissemination of sensitive data. However, achieving complete control over the workings of the models remains an ongoing endeavor. To enhance reliability and accuracy, continuous training and learning have improved the accuracy of root cause predictions, along with accompanying analysis.

Enabling AI to take autonomous actions based on the proactive identification of anomalies or metric drifts can create self-healing, cost-optimized, and performance-tuned cloud environments, significantly reducing the need for manual intervention and allowing teams to focus on strategic initiatives.

About the author:

Vijay Verma is Chief Revenue Officer – Service Lines at Persistent Systems.