21th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

Urban Land Use Classification of High-Resolution Aerial Imagery Using RemoteCLIP: A Case Study of Sofia

Spasov Angel, Hristov Emil, Shirinyan Evgeny, Petrova-Antonova Dessislava

Abstract:

  Advancements in vision-language foundation models have enabled new applications in remote sensing, particularly for land use and land cover (LULC) classification. This study examines the applicability of the Vision Language Foundation Model for Remote Sensing (RemoteCLIP), a fine-tuned multi-modal model, for retrieving semantic information from high-resolution aerial imagery of Sofia, Bulgaria—a dataset not included in the model’s training data. The study evaluates RemoteCLIP’s zero-shot classification, semantic clustering, and text-based image retrieval performance, highlighting both its strengths and limitations. Findings suggest that RemoteCLIP effectively clusters semantically similar images, surpassing the general Contrastive Lan-guage-Image Pre-Training (CLIP) model in distinguishing urban features. However, classification accuracy is influenced by spatial resolution and the choice of descriptive labels. A comparative analysis with CLIP demonstrates that RemoteCLIP groups images based on content rather than image resolution. Classification performance is evaluated at multiple resolutions (15 cm–3 m per pixel), revealing that higher resolutions capture finer details, while lower resolutions provide better generalization for land use mapping. A QGIS-based visualization pipeline integrates classification results into a geospatial format for further analysis. The study concludes that RemoteCLIP can serve as a valuable tool for automated land use analysis, urban planning, and point-of-interest detection, particularly in regions lacking detailed vector datasets. Challenges such as computational requirements and classification biases are discussed, along with recommendations for improving prompt selection and spatial reasoning in future models.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.