Semantic Scholar Open Access 2022 1830 sitasi

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Zhiqi Li Wenhai Wang Hongyang Li Enze Xie Chonghao Sima +3 lainnya

Abstrak

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

Topik & Kata Kunci

Penulis (8)

Z

Zhiqi Li

W

Wenhai Wang

H

Hongyang Li

E

Enze Xie

C

Chonghao Sima

T

Tong Lu

Q

Qiao Yu

J

Jifeng Dai

Format Sitasi

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T. et al. (2022). BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. https://doi.org/10.48550/arXiv.2203.17270

Akses Cepat

Lihat di Sumber doi.org/10.48550/arXiv.2203.17270
Informasi Jurnal
Tahun Terbit
2022
Bahasa
en
Total Sitasi
1830×
Sumber Database
Semantic Scholar
DOI
10.48550/arXiv.2203.17270
Akses
Open Access ✓