Robust Explainable AI via Adversarial Latent Diffusion Models: Mitigating Gradient Obfuscation with Interpretable Feature Attribution

Main Article Content

Tejaskumar Dattatray Pujari, Deepak Kumar Kejriwal, Anshul Goel

Abstract

This study introduces the Adversarial Latent Diffusion Explanations (ALDE) framework, a novel approach aimed at improving the robustness and interpretability of explainable AI (XAI) methods under adversarial conditions. An experimental research design was used to integrate diffusion models with adversarial training, focusing on deep image classification tasks. The framework was tested using two popular datasets—ImageNet and CIFAR-10—and two pre-trained deep learning models, ResNet-50 and WideResNet-28-10.


The ALDE framework combines a Denoising Diffusion Probabilistic Model (DDPM) for input purification with Projected Gradient Descent (PGD) for adversarial training. For explanation generation, Integrated Gradients was employed to produce interpretable feature attributions. The models were evaluated based on adversarial robustness, explanation stability (measured by Structural Similarity Index Measure, SSIM), and interpretability (using Intersection over Union, IoU, with saliency maps).


Results show that ALDE significantly outperforms existing XAI methods like SHAP and LIME. On ImageNet, ResNet-50’s adversarial accuracy increased from 41.2% (SHAP) to 55.3% with ALDE. Similarly, SSIM improved from 0.56 to 0.82, and IoU from 0.47 to 0.63. WideResNet models saw similar gains. These improvements confirm ALDE’s effectiveness in enhancing model defense while producing more stable and semantically accurate explanations.


In summary, ALDE demonstrates a strong ability to defend against gradient-based adversarial attacks and deliver reliable, interpretable attributions. This research contributes toward building trustworthy AI systems by addressing the key challenge of explanation degradation under adversarial influence.

Article Details

Section
Articles