22nd AIAI 2026, 16 - 19 July 2026, Chania, Crete, Greece

Adversarial Probing of Fragility in Integrated Gradients Attributions

BANO ARJU, Pamula Rajendra

Abstract:

  Integrated Gradients (IG) is among the most widely adopted gradient-based attribution methods for explaining deep neural network predictions, yet the stability of its explanations under imperceptible input perturbations remains poorly understood. We introduce Adversarial Fragility Probing (AFP), a diagnostic framework that quantifies per-image IG explanation sensitivity via a scalar fragility score F (x) under a classification-preserving ℓ∞ constraint, requiring no ground-truth annotations or model modification. Experiments across 1000 ImageNet images on ResNet50 and VGG16-BN, and 500 CIFAR-10 images on ResNet56 establish three findings: explanation fragility is an image-intrinsic property (rs = 0.768 cross-architecture); model confidence is a dataset-dependent predictor of fragility, positively correlated on ImageNet and negatively on CIFAR-10; and on CIFAR-10, fragility exhibits strong class-level structure driven by inter-class visual similarity. Method-agnostic validation on GradientShap (rs = 0.887 cross-method) and transferability analysis confirms AFP’s diagnostic reliability. Fragility scores can serve as per-image reliability flags for safety-critical XAI deployments.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.