Integrating spatially aware instruction tuning with Low-RankAdaptation (LoRA) in pre-trained small-scale models unlocks enhancedperformance in vision-language tasks. Although such models excel ongeneral-domain data, they frequently lack the precise spatial reasoningand computational efficiency observed in large-scale architectures. De-spite generating detailed textual descriptions, these models frequentlystruggle with tasks that require fundamental spatial reasoning abili-ties—such as distinguishing left from right or accurately discerning therelative positions of objects within a given context. This work investi-gates how integrating image-space coordinate annotations from the GQAdataset enhances spatial awareness in the Deep Seek 1.3B Visual Lan-guage Model (VLM). By freezing pre-trained weights and introducingspecialized, trainable low-rank matrices through LoRA, only an addi-tional 15 million parameters (1.15% of the overall model) and 60MBof memory are required, yielding a 16% performance improvement. Ex-perimental evaluations reveal that this combined approach achieves anaccuracy of 60.3%, surpassing comparable small-scale models and ap-proaching the performance of large-scale architectures (63.5% accuracy)despite utilizing only 1.3B parameters—five times fewer than its largercounterparts. This efficiency-performance trade-off highlights the achiev-ability of lightweight adaptation for the capability gap between small-and large-scale architectures in vision-language tasks. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.