AI-Driven Incident Management for Distributed Cloud Systems: Detection, Mitigation, and Root Cause Automation

Harpreet Paramjeet Singh

doi:10.52783/jisem.v11i1s.14216

PDF

Published: Jan 5, 2026

DOI: https://doi.org/10.52783/jisem.v11i1s.14216

Keywords:

AIOps, Anomaly Detection, Automated Root Cause Analysis, Predictive Mitigation, Incident Management

Harpreet Paramjeet Singh

Abstract

Artificial intelligence for IT operations signifies a paradigmatic shift in managing hyperscale cloud distributions because manual incident management strategies fail to scale. This rises from the growing complexity of service dependencies, increasingly high numbers of alerts, and failure propagation patterns. Multiple learning anomaly detection strategies use self-adjusting thresholds and combined signal analysis to isolate occurrences of operation beyond the norm on an operational timeline. Meanwhile, intelligent alert consolidation strategies use intelligence to group alerts based on commonalities. Autopsy-style diagnosis uses causal analysis along with large language models to synthesize incident information from a data mashup of various telemetry data. Meanwhile, predictive repair uses time-series forecasting along with reinforcement learning to predict repair strategies in the form of proactive repair before the materialization of operational impacts. Similarly, incident lookup uses the same approach to quickly restore operations by accessing organizational memory. Of course, the convergence of the approaches makes possible a fully closed-loop automated system, where detection and repair occur independently for appropriately modeled operational failure modes. A satisfactory evaluation framework focuses upon the detection accuracy of the system, operational efficiency gains, and reduction of cognitive loads to show significant improvement in mean time to detection and mean time to repair while maintaining human oversight thresholds for high-risk operational scenarios.

Issue

Vol. 11 No. 1s (2026)

Section

Articles

Journal of Information Systems Engineering and Management

AI-Driven Incident Management for Distributed Cloud Systems: Detection, Mitigation, and Root Cause Automation

Abstract

Volume 11 (2026)

Volume 10 (2025)

Volume 9 (2024)

Volume 8 (2023)

Volume 7 (2022)

Volume 6 (2021)

Volume 5 (2020)

Volume 4 (2019)

Volume 3 (2018)

Volume 2 (2017)

Volume 1 (2016)

Journal of Information Systems Engineering and Management

Article Sidebar

Main Article Content

Abstract

Article Details