AI-Augmented Self-Healing Infrastructure: Combining Health Probes with Remediation Playbooks
Main Article Content
Abstract
Self-healing infrastructure has evolved as a foundational component in resilient, cloud-native systems. This paper introduces an advanced framework that enhances traditional health probe-driven remediation with artificial intelligence and machine learning. By integrating AI-powered anomaly detection, adaptive remediation strategies, and generative playbook synthesis, the proposed architecture transforms reactive fault response into a proactive, predictive, and autonomous paradigm. Utilizing native observability tools like Kubernetes, AWS CloudWatch, and Prometheus, combined with LLM-based pattern inference, we build a multi-tiered AI-driven monitoring system. Event-driven automation via AWS Lambda and EventBridge is extended with intelligent decision engines and reinforcement learning loops. Remediation workflows are executed through Ansible and AWS Systems Manager and enhanced by AI-generated playbooks tailored to novel incidents. Empirical validation shows dramatic reductions in MTTR, enhanced failure prevention rates, and lower operational overhead. This research redefines self-healing as an intelligent, continuously evolving capability vital for multi-cloud resilience. To our knowledge, this is the first framework to integrate LLMs and RL for playbook synthesis in self-healing cloud environments.