arXiv Open Access 2025

The TUB Sign Language Corpus Collection

Eleftherios Avramidis Vera Czehmann Fabian Deckert Lorenz Hufe Aljoscha Lipski +6 lainnya

Lihat Sumber

Abstrak

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

Topik & Kata Kunci

cs.CL

Penulis (11)

Eleftherios Avramidis

Vera Czehmann

Fabian Deckert

Lorenz Hufe

Aljoscha Lipski

Yuni Amaloa Quintero Villalobos

Tae Kwon Rhee

Mengqian Shi

Lennart Stölting

Fabrizio Nunnari

Sebastian Möller

Format Sitasi

APA MLA BibTeX

Avramidis, E., Czehmann, V., Deckert, F., Hufe, L., Lipski, A., Villalobos, Y.A.Q. et al. (2025). The TUB Sign Language Corpus Collection. https://arxiv.org/abs/2508.05374

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓