DOAJ Open Access 2026

Visual state space models with spiral selective scan for referring remote-sensing image segmentation

Weihao Shen Ailong Ma Zhuo Zheng Junjue Wang Yanfei Zhong

Abstrak

Referring remote-sensing image segmentation (RRSIS) aims to accurately localize and delineate ground targets within remote-sensing imagery conditioned on natural language expressions. This task fundamentally relies on the effective fusion of visual and language modalities, typically implemented through multimodal encoders and task-specific decoders. Existing frameworks struggle to achieve precise alignment between remote-sensing image visual features and natural language referring expressions, as the target results of referring image segmentation tasks focus on tiny objects in local areas of interest, while the spatial scale of remote-sensing images is complex and variable and referring expressions are complex natural languages, making it difficult to perform effective visual language alignment. We propose the spiral selective scan visual state space model (RSRefMa), which enhances the alignment of visual and language features from the perspective of enhancing global contextual multimodal understanding of remote-sensing images and natural language referring expressions. Specifically, we propose a state space model based on spiral scanning to effectively capture globally representative visual features, thereby enhancing the model’s capability for global visual context modeling. To enrich and diversify referring expressions, we leverage the advanced contextual understanding and language reasoning capabilities of large language models. This augmentation leads to semantically rich and diverse expressions, strengthening the model’s comprehension of both visual and language modalities and ensuring more precise cross-modal alignment. Furthermore, to address the difficulty of fine-grained segmentation of tiny, weak geo-objects in remote-sensing image, we integrate dual multi-scale visual prompts enabling a comprehensive representation of foreground and background features for accurate boundary segmentation. This design enhances the alignment between visual and language features, thereby enabling more precise target localization. Extensive experiments on two referring remote-sensing image segmentation datasets demonstrate that our proposed RSRefMa method has superior understanding and segmentation performance compared to previous state-of-the-art methods.

Penulis (5)

W

Weihao Shen

A

Ailong Ma

Z

Zhuo Zheng

J

Junjue Wang

Y

Yanfei Zhong

Format Sitasi

Shen, W., Ma, A., Zheng, Z., Wang, J., Zhong, Y. (2026). Visual state space models with spiral selective scan for referring remote-sensing image segmentation. https://doi.org/10.1080/10095020.2026.2642509

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1080/10095020.2026.2642509
Informasi Jurnal
Tahun Terbit
2026
Sumber Database
DOAJ
DOI
10.1080/10095020.2026.2642509
Akses
Open Access ✓