22nd AIAI 2026, 16 - 19 July 2026, Chania, Crete, Greece

Residual Adapter Learning With Attention Pooling for Medical Visual Question Answering

Nuseir Aya, Ebner Marc

Abstract:

  Medical Visual Question Answering (Med-VQA) seeks to accurately answer textual queries about medical images, thereby facilitating the interpretation of clinical information. Although recent advances have been made in Med-VQA, important challenges remain, especially in integrating heterogeneous multimodal semantic representations from limited datasets for answer prediction. Additionally, biases in medical VQA datasets often lead to suboptimal model performance. We propose an approach that enhances multimodal reasoning by incorporating several modules, including a residual adaptation module that refines image representations and improves compatibility between pretrained biomedical encoders. Furthermore, we utilize a pooling-based attention mechanism to generate more informative representations. The proposed method is evaluated on three public Med-VQA benchmarks: SLAKE, PathVQA, and VQA-Med 2019. Results are reported as mean ± standard deviation over five runs. The results demonstrate consistent improvements in accuracy compared to baseline approaches. Further ablation studies demonstrate the effectiveness of the residual adaptation layer and the pool-attention mechanism in enhancing cross-modal feature alignment and answer prediction performance. Also, we study the impact of Cross-Entropy versus Label Smoothing losses in improving the model’s performance.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.