arXiv Open Access 2025

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Siyi Chen Kai Wang Weicong Pang Ruiming Yang Ziru Chen +5 lainnya
Lihat Sumber

Abstrak

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

Topik & Kata Kunci

Penulis (10)

S

Siyi Chen

K

Kai Wang

W

Weicong Pang

R

Ruiming Yang

Z

Ziru Chen

R

Renjun Gao

A

Alexis Kai Hon Lau

D

Dasa Gu

C

Chenchen Zhang

C

Cheng Li

Format Sitasi

Chen, S., Wang, K., Pang, W., Yang, R., Chen, Z., Gao, R. et al. (2025). MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging. https://arxiv.org/abs/2509.18693

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓