arXiv Open Access 2026

Limited Linguistic Diversity in Embodied AI Datasets

Selma Wanna Agnes Luhtaru Jonathan Salfity Ryan Barron Juston Moore +2 lainnya

Lihat Sumber

Abstrak

Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.

Topik & Kata Kunci

cs.CL cs.AI cs.RO

Penulis (7)

Selma Wanna

Agnes Luhtaru

Jonathan Salfity

Ryan Barron

Juston Moore

Cynthia Matuszek

Mitch Pryor

Format Sitasi

APA MLA BibTeX

Wanna, S., Luhtaru, A., Salfity, J., Barron, R., Moore, J., Matuszek, C. et al. (2026). Limited Linguistic Diversity in Embodied AI Datasets. https://arxiv.org/abs/2601.03136

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓