arXiv Open Access 2026

TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma Quanfeng Lu Shuai Zhong Dahai Yu Ping Luo +1 lainnya

Lihat Sumber

Abstrak

Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

Topik & Kata Kunci

cs.CV cs.AI cs.CL

Penulis (6)

Zhantao Ma

Quanfeng Lu

Shuai Zhong

Dahai Yu

Ping Luo

Michael K. Ng

Format Sitasi

APA MLA BibTeX

Ma, Z., Lu, Q., Zhong, S., Yu, D., Luo, P., Ng, M.K. (2026). TVWorld: Foundations for Remote-Control TV Agents. https://arxiv.org/abs/2601.13142

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓