arXiv Open Access 2025

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Siyi Chen Kai Wang Weicong Pang Ruiming Yang Ziru Chen +5 lainnya

Lihat Sumber

Abstrak

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

Topik & Kata Kunci

cs.CV

Penulis (10)

Siyi Chen

Kai Wang

Weicong Pang

Ruiming Yang

Ziru Chen

Renjun Gao

Alexis Kai Hon Lau

Dasa Gu

Chenchen Zhang

Cheng Li

Format Sitasi

APA MLA BibTeX

Chen, S., Wang, K., Pang, W., Yang, R., Chen, Z., Gao, R. et al. (2025). MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging. https://arxiv.org/abs/2509.18693

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓