arXiv Open Access 2023

Multi-entity Video Transformers for Fine-Grained Video Representation Learning

Matthew Walmer Rose Kanjirathinkal Kai Sheng Tai Keyur Muzumdar Taipeng Tian +1 lainnya

Lihat Sumber

Abstrak

The area of temporally fine-grained video representation learning focuses on generating frame-by-frame representations for temporally dense tasks, such as fine-grained action phase classification and frame retrieval. In this work, we advance the state-of-the-art for self-supervised models in this area by re-examining the design of transformer architectures for video representation learning. A key aspect of our approach is the improved sharing of scene information in the temporal pipeline by representing multiple salient entities per frame. Prior works use late-fusion architectures that reduce frames to a single-dimensional vector before modeling any cross-frame dynamics. In contrast, our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time. To achieve this, we propose a Learnable Spatial Token Pooling strategy to identify and extract features for multiple salient regions per frame. Through our experiments, we show that MV-Former outperforms previous self-supervised methods, and also surpasses some prior works that use additional supervision or training data. When combined with additional pre-training data from Kinetics-400, MV-Former achieves a further performance boost. Overall, our MV-Former achieves state-of-the-art results on multiple fine-grained video benchmarks and shows that parsing video scenes as collections of entities can enhance performance in video tasks.

Topik & Kata Kunci

cs.CV

Penulis (6)

Matthew Walmer

Rose Kanjirathinkal

Kai Sheng Tai

Keyur Muzumdar

Taipeng Tian

Abhinav Shrivastava

Format Sitasi

APA MLA BibTeX

Walmer, M., Kanjirathinkal, R., Tai, K.S., Muzumdar, K., Tian, T., Shrivastava, A. (2023). Multi-entity Video Transformers for Fine-Grained Video Representation Learning. https://arxiv.org/abs/2311.10873

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓