Semantic Scholar Open Access 2019 482 sitasi

MuST-C: a Multilingual Speech Translation Corpus

Mattia Antonino Di Gangi R. Cattoni L. Bentivogli Matteo Negri Marco Turchi

Abstrak

Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.

Topik & Kata Kunci

Penulis (5)

M

Mattia Antonino Di Gangi

R

R. Cattoni

L

L. Bentivogli

M

Matteo Negri

M

Marco Turchi

Format Sitasi

Gangi, M.A.D., Cattoni, R., Bentivogli, L., Negri, M., Turchi, M. (2019). MuST-C: a Multilingual Speech Translation Corpus. https://doi.org/10.18653/v1/N19-1202

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.18653/v1/N19-1202
Informasi Jurnal
Tahun Terbit
2019
Bahasa
en
Total Sitasi
482×
Sumber Database
Semantic Scholar
DOI
10.18653/v1/N19-1202
Akses
Open Access ✓