21th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

VisionCore: Adapting Efficient Visual Answering Through Enhanced Spatial Reasoning

ozkul ufuk, Levent KARACAN, Cemil ZALLUHOGLU

Abstract:

  Integrating spatially aware instruction tuning with Low-Rank Adaptation (LoRA) in pre-trained small-scale models unlocks enhanced performance in vision-language tasks. Although such models excel on general-domain data, they frequently lack the precise spatial reasoning and computational efficiency observed in large-scale architectures. Despite generating detailed textual descriptions, these models frequently struggle with tasks that require fundamental spatial reasoning abilities—such as distinguishing left from right or accurately discerning the relative positions of objects within a given context. This work investigates how integrating image-space coordinate annotations from the GQA dataset enhances spatial awareness in the Deep Seek 1.3B Visual Language Model (VLM). By freezing pre-trained weights and introducing specialized, trainable low-rank matrices through LoRA, only an additional 15 million parameters (1.15% of the overall model) and 60 MB of memory are required, yielding a 16% performance improvement. Experimental evaluations reveal that this combined approach achieves an accuracy of 60.3%, surpassing comparable small-scale models and approaching the performance of large-scale architectures (63.5% accuracy) despite utilizing only 1.3 B parameters—five times fewer than its larger counterparts. This efficiency-performance trade-off highlights the achievability of lightweight adaptation for the capability gap between small- and large-scale architectures in vision-language tasks.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.