arXiv Open Access 2025

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou Haotian Xia Zhen Ye Shengjie Zhang Christopher Lai +3 lainnya

Lihat Sumber

Abstrak

Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-task, multi-sport video understanding. DeepSport shifts from passive frame processing to active, iterative reasoning, dynamically extracting frames to "think with videos." To train our model, we curate a unified 78k-sample dataset via a rigorous three-step text-and-vision distillation pipeline. We then employ a progressive two-stage training strategy: a Sports Curriculum Supervised Fine-Tuning phase to build foundational perception, followed by Agentic Reinforcement Learning with a novel tool-use reward. Extensive experiments on a comprehensive 6.7k benchmark demonstrate that DeepSport achieves state-of-the-art performance, outperforming powerful proprietary and open-source models, while utilizing significantly fewer frames. Furthermore, it exhibits strong zero-shot transferability to unseen sports and broad motion recognition tasks, establishing a highly efficient and generalized foundation for complex video reasoning.

Topik & Kata Kunci

cs.CV cs.AI

Penulis (8)

Junbo Zou

Haotian Xia

Zhen Ye

Shengjie Zhang

Christopher Lai

Vicente Ordonez

Weining Shen

Hanjie Chen

Format Sitasi

APA MLA BibTeX

Zou, J., Xia, H., Ye, Z., Zhang, S., Lai, C., Ordonez, V. et al. (2025). DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning. https://arxiv.org/abs/2511.12908

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓