Visual state space models with spiral selective scan for referring remote-sensing image segmentation
Abstrak
Referring remote-sensing image segmentation (RRSIS) aims to accurately localize and delineate ground targets within remote-sensing imagery conditioned on natural language expressions. This task fundamentally relies on the effective fusion of visual and language modalities, typically implemented through multimodal encoders and task-specific decoders. Existing frameworks struggle to achieve precise alignment between remote-sensing image visual features and natural language referring expressions, as the target results of referring image segmentation tasks focus on tiny objects in local areas of interest, while the spatial scale of remote-sensing images is complex and variable and referring expressions are complex natural languages, making it difficult to perform effective visual language alignment. We propose the spiral selective scan visual state space model (RSRefMa), which enhances the alignment of visual and language features from the perspective of enhancing global contextual multimodal understanding of remote-sensing images and natural language referring expressions. Specifically, we propose a state space model based on spiral scanning to effectively capture globally representative visual features, thereby enhancing the model’s capability for global visual context modeling. To enrich and diversify referring expressions, we leverage the advanced contextual understanding and language reasoning capabilities of large language models. This augmentation leads to semantically rich and diverse expressions, strengthening the model’s comprehension of both visual and language modalities and ensuring more precise cross-modal alignment. Furthermore, to address the difficulty of fine-grained segmentation of tiny, weak geo-objects in remote-sensing image, we integrate dual multi-scale visual prompts enabling a comprehensive representation of foreground and background features for accurate boundary segmentation. This design enhances the alignment between visual and language features, thereby enabling more precise target localization. Extensive experiments on two referring remote-sensing image segmentation datasets demonstrate that our proposed RSRefMa method has superior understanding and segmentation performance compared to previous state-of-the-art methods.
Topik & Kata Kunci
Penulis (5)
Weihao Shen
Ailong Ma
Zhuo Zheng
Junjue Wang
Yanfei Zhong
Akses Cepat
PDF tidak tersedia langsung
Cek di sumber asli →- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1080/10095020.2026.2642509
- Akses
- Open Access ✓