LLM for Automated Root Cause Analysis in Microservices Architectures
Main Article Content
Abstract
The adoption of microservices architectures has introduced significant challenges in diagnosing and resolving system issues efficiently, as multiple services handling millions of requests generate vast volumes of exceptions, including business errors and critical runtime failures. Traditional manual approaches for error analysis and Root Cause Analysis (RCA) are time-consuming, error-prone and lack scalability. Existing tools often aggregate exceptions but fail to effectively classify or diagnose root causes, leading to prolonged system downtimes and reduced developer productivity. This research proposes an automated solution leveraging fine-tuned Large Language Models (LLMs), combined with an Exception Classifier powered by Natural Language Processing (NLP) and machine learning techniques. The system adopts a layered architecture where exceptions are aggregated through a Kafka cluster. The Exception Classifier preprocesses error messages, extracts contextual information and categorizes exceptions into business and runtime errors. Classified runtime errors are forwarded to the RCA service, where fine-tuned LLMs perform detailed diagnostics by analyzing exception stack trace and tokenized code repository. The solution targets over 90% precision for business exceptions and 89.6% recall for runtime exceptions in Exception Classification and less than a second for RCA diagnostics per exception and gives over 85% accuracy in human qualitative evaluation. By automating error classification and RCA, the proposed system promises faster fault resolution, improved RCA accuracy and enhanced developer productivity, contributing to more resilient and efficient microservices environments.