arXiv Open Access 2024

Annotation Free Semantic Segmentation with Vision Foundation Models

Soroush Seifi Daniel Olmeda Reino Fabien Despinoy Rahaf Aljundi

Lihat Sumber

Abstrak

Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, we generate free annotations for any semantic segmentation dataset using existing foundation models. We use CLIP to detect objects and SAM to generate high quality object masks. Next, we build a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation. Our approach can bring language-based semantics to any pretrained vision encoder with minimal training, uses foundation models as the sole source of supervision and generalizes from little training data with no annotation.

Topik & Kata Kunci

cs.CV

Penulis (4)

Soroush Seifi

Daniel Olmeda Reino

Fabien Despinoy

Rahaf Aljundi

Format Sitasi

APA MLA BibTeX

Seifi, S., Reino, D.O., Despinoy, F., Aljundi, R. (2024). Annotation Free Semantic Segmentation with Vision Foundation Models. https://arxiv.org/abs/2403.09307

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓