arXiv Open Access 2026

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Daniel Oliveira David Martins de Matos
Lihat Sumber

Abstrak

Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.

Topik & Kata Kunci

Penulis (2)

D

Daniel Oliveira

D

David Martins de Matos

Format Sitasi

Oliveira, D., Matos, D.M.d. (2026). StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles. https://arxiv.org/abs/2602.21829

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓