OSM-based Domain Adaptation for Remote Sensing VLMs
Abstract
A self-contained domain adaptation framework for vision-language models in remote sensing uses OpenStreetMap data and optical character recognition to generate captions without requiring external teachers or manual labeling.
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
Community
OSMDA-VLM offers a clever way to scale remote sensing models without the massive headache of manual labeling or the high cost of GPT-4V distillation. By using OpenStreetMap data, the model basically acts as its own annotation engine, leveraging its built-in OCR and chart-reading abilities to label aerial imagery automatically. This setup hits SOTA across ten benchmarks and proves you can get high-end performance using open-source geographic metadata instead of expensive human or teacher-model labels.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data (2026)
- Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery (2026)
- Bi-modal Textual Prompt Learning for Vision-language Models in Remote Sensing (2026)
- AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network (2026)
- GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning (2026)
- Enabling Training-Free Text-Based Remote Sensing Segmentation (2026)
- TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper