Reliability Engineering for AI-Optimized GPU Platforms in Mission-Critical Systems
Main Article Content
Abstract
AI-optimized GPGUs are part of mission-critical applications, including autonomous driving, medical diagnostics, and defense. However, these platforms exhibit distinct failure cases compared to traditional computing systems, including memory-bound kernels, sensitivity to mixed precision, and workload distortion. This study introduces a multi-layer reliability engineering methodology that encompasses hardware, firmware, orchestration, models, and data pipelines to address these issues. It employs classical reliability modeling (RBD/Markov), acceleration testing, and survival testing, while also incorporating SRE practices and chaos engineering to optimize AI workload reliability. The most notable approaches include failure-injection campaigns, fleet-scale telemetry, and predictive maintenance, all of which are related to service-level objectives (SLOs) and aligned with the goal of safety. These findings indicate that availability results have improved significantly, with spend under 60 seconds and a p99 latency of less than 50 ms on average, in most instances. Moreover, predictive maintenance increased the AUC to 0.83 because the number of unpredicted node failures was reduced by 34%. The research provides a practical reliability system, measurement handbook, and validation guidelines that can be duplicated in safety-tested settings with the application of GPU AI. Such contributions will make it much easier to balance standards at the industry level and guarantee that AI systems supporting the mission objectives satisfy high requirements regarding reliability and safety.