arXiv Open Access 2026

TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma Quanfeng Lu Shuai Zhong Dahai Yu Ping Luo +1 lainnya
Lihat Sumber

Abstrak

Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

Topik & Kata Kunci

Penulis (6)

Z

Zhantao Ma

Q

Quanfeng Lu

S

Shuai Zhong

D

Dahai Yu

P

Ping Luo

M

Michael K. Ng

Format Sitasi

Ma, Z., Lu, Q., Zhong, S., Yu, D., Luo, P., Ng, M.K. (2026). TVWorld: Foundations for Remote-Control TV Agents. https://arxiv.org/abs/2601.13142

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓