21^th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

Large Vision Model (LVM): An Exploration of Vision Models with Visual Sentences of Images and Videos

Zafar Muhammad Hamza, Sharma Vijeta

Abstract:

The rapid advancement of computer vision has been significantly influenced by Large Vision Models (LVMs), which leverage transformer-based architectures and self-supervised learning techniques to handle multiple perception tasks within a unified framework. This study explores the capabilities of LVMs for video processing, focusing on their ability to perform object detection, segmentation, captioning, and optical character recognition (OCR) without fine-tuning. Unlike previous research that primarily benchmarks performance, this work evaluates the adaptability of LVMs for these tasks. To the best of our knowledge, this is the first formal study applying Florence-2 to video data, assessing its effectiveness across these four distinct tasks. A dedicated video processing pipeline was developed to extract frames, process them using LVMs, and visualize outputs with masks, bounding boxes, and captions. Using the diverse UCF101 action recognition dataset, this study examines how well LVMs generalize to video-based tasks and evaluates inference times to determine feasibility for real-time applications, particularly in robotics. Key contributions include a multi-task video processing framework, an analysis of LVM generalization for video data, and an assessment of task-specific computational efficiency. Preliminary results indicate that while object detection and OCR perform well, segmentation presents computational challenges. By enabling a single model to handle multiple perception tasks, LVMs offer a promising alternative to task-specific models, improving adaptability and efficiency in robotic systems. This research provides valuable insights into the future of general-purpose LVMs for multi-task video analysis.

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.

21th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

Large Vision Model (LVM): An Exploration of Vision Models with Visual Sentences of Images and Videos

Zafar Muhammad Hamza, Sharma Vijeta

Abstract:

21^th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus