arXiv Open Access 2023

Towards Visual Foundational Models of Physical Scenes

Chethan Parameshwara Alessandro Achille Matthew Trager Xiaolong Li Jiawei Mo +6 lainnya

Lihat Sumber

Abstrak

We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.

Topik & Kata Kunci

cs.CV cs.AI cs.LG cs.RO

Penulis (11)

Chethan Parameshwara

Alessandro Achille

Matthew Trager

Xiaolong Li

Jiawei Mo

Matthew Trager

Ashwin Swaminathan

CJ Taylor

Dheera Venkatraman

Xiaohan Fei

Stefano Soatto

Format Sitasi

APA MLA BibTeX

Parameshwara, C., Achille, A., Trager, M., Li, X., Mo, J., Trager, M. et al. (2023). Towards Visual Foundational Models of Physical Scenes. https://arxiv.org/abs/2306.03727

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓