Semantic Scholar Open Access 2021 2596 sitasi

Vision Transformers for Dense Prediction

René Ranftl Alexey Bochkovskiy V. Koltun

Lihat Sumber DOI

Abstrak

We introduce dense prediction transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense prediction transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense prediction transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Topik & Kata Kunci

Computer Science

Penulis (3)

René Ranftl

Alexey Bochkovskiy

V. Koltun

Format Sitasi

APA MLA BibTeX

Ranftl, R., Bochkovskiy, A., Koltun, V. (2021). Vision Transformers for Dense Prediction. https://doi.org/10.1109/ICCV48922.2021.01196

Akses Cepat

Lihat di Sumber doi.org/10.1109/ICCV48922.2021.01196

Informasi Jurnal

Tahun Terbit: 2021
Bahasa: en
Total Sitasi: 2596×
Sumber Database: Semantic Scholar
DOI: 10.1109/ICCV48922.2021.01196
Akses: Open Access ✓