Semantic Scholar Open Access 2019 482 sitasi

MuST-C: a Multilingual Speech Translation Corpus

Mattia Antonino Di Gangi R. Cattoni L. Bentivogli Matteo Negri Marco Turchi

Abstrak

Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.

Topik & Kata Kunci

Computer Science

Penulis (5)

Mattia Antonino Di Gangi

R. Cattoni

L. Bentivogli

Matteo Negri

Marco Turchi

Format Sitasi

APA MLA BibTeX

Gangi, M.A.D., Cattoni, R., Bentivogli, L., Negri, M., Turchi, M. (2019). MuST-C: a Multilingual Speech Translation Corpus. https://doi.org/10.18653/v1/N19-1202

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.18653/v1/N19-1202

Informasi Jurnal

Tahun Terbit: 2019
Bahasa: en
Total Sitasi: 482×
Sumber Database: Semantic Scholar
DOI: 10.18653/v1/N19-1202
Akses: Open Access ✓