CVCPSG: Discovering Composite Visual Clues for Panoptic Scene Graph Generation
Abstrak
Abstract Panoptic Scene Graph Generation (PSG) aims to segment objects and predict the relation triplets <subject, relation, object> within an image. Despite the impressive achievements in PSG, current methods still struggle to capture fine-grained visual context, eschewing spatial and situational information in favor of visual features related to object identity. This limitation naturally impedes the model’s ability to distinguish subtle visual differences between relation triplets, such as “cat-on-person” and “cat-lying on-person”. To address this challenge, we propose CVCPSG, a novel DETR-based method that uncovers composite visual clues for PSG. Specifically, drawing inspiration from how humans capture visual context using diverse visual clues, we first construct a composite visual clues bank based on three key aspects: object, spatial, and situational. Then, we introduce a multi-level visual extractor to align visual features from objects, interactions, and image levels with the composite visual clues bank. Additionally, we incorporate a cross-modal learning module with a multitower architecture to seamlessly integrate visual clues into the relation decoder, thereby improving PSG detection. Extensive experiments on two PSG benchmarks confirm the effectiveness and interpretability of CVCPSG.
Topik & Kata Kunci
Penulis (4)
Nanhao Liang
Xiaoyuan Yang
Yingwei Xia
Yong Liu
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.1007/s44443-025-00063-w
- Akses
- Open Access ✓