AG-CLIP: Attribute-Guided CLIP for Zero-Shot Fine-Grained Recognition
Abstrak
Zero-shot fine-grained recognition is challenging due to high visual similarities between classes and the inferior encoding of fine-grained features in embedding models. In this work, we present an attribute-guided Contrastive Language-Image Pre-training (AG-CLIP) model with an additional attribute encoder. Our approach first identifies relevant visual attributes from the textual class descriptions using an attribute mining module leveraging a large language model (LLM) GPT-4o. The attributes are then used to construct prompts for an open vocabulary object/region detector to extract relevant corresponding image regions. The attribute text, along with focused regions of the input, then guides the CLIP model to focus on these discriminative attributes during fine-tuning through a context-attribute fusion module. Our attribute-guided attention mechanism allows CLIP to effectively disambiguate fine-grained classes by highlighting their distinctive attributes without requiring fine-tuning or additional training data on unseen classes. We evaluate our approach on the CUB-200-2011 and plant disease datasets, achieving 73.3% and 84.6% accuracy, respectively. Our method achieves state-of-the-art zero-shot performance, outperforming prior methods that rely on external knowledge bases or complex meta-learning strategies. The strong results demonstratethe effectiveness of injecting generic attribute awareness into powerful vision-language models like CLIP for tackling fine-grained recognition in a zero-shot manner.
Topik & Kata Kunci
Penulis (6)
Jamil Ahmad
Mustaqeem Khan
Wail Guiaeab
Abdulmotaleb Elsaddik
Giulia De Masi
Fakhri Karray
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1109/OJCS.2026.3654171
- Akses
- Open Access ✓