XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra
et al.
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.
982 sitasi
en
Computer Science, Engineering
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
Di Jin, Eileen Pan, Nassim Oufattole
et al.
Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.
1525 sitasi
en
Computer Science
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila, Megan Branson, Kelly Davis
et al.
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.
2253 sitasi
en
Computer Science
XNLI: Evaluating Cross-lingual Sentence Representations
Alexis Conneau, Guillaume Lample, Ruty Rinott
et al.
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
1574 sitasi
en
Computer Science
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Adina Williams, Nikita Nangia, Samuel R. Bowman
This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), improving upon available resources in both its coverage and difficulty. MultiNLI accomplishes this by offering data from ten distinct genres of written and spoken English, making it possible to evaluate systems on nearly the full complexity of the language, while supplying an explicit setting for evaluating cross-genre domain adaptation. In addition, an evaluation using existing machine learning models designed for the Stanford NLI corpus shows that it represents a substantially more difficult task than does that corpus, despite the two showing similar levels of inter-annotator agreement.
4956 sitasi
en
Computer Science
Word Translation Without Parallel Data
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato
et al.
State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.
1754 sitasi
en
Computer Science
Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato
Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.
1143 sitasi
en
Computer Science
A SICK cure for the evaluation of compositional distributional semantic models
M. Marelli, Stefano Menini, Marco Baroni
et al.
931 sitasi
en
Computer Science
A method for linguistic metaphor identification : from MIP to MIPVU
G. Steen
1320 sitasi
en
Computer Science
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
Marco Baroni, Silvia Bernardini, Adriano Ferraresi
et al.
1268 sitasi
en
Computer Science
The Alchemy of English: The Spread, Functions, and Models of Non-native Englishes
Braj B. Kachru
Guidelines for the process of cross-cultural adaptation of self-report measures.
D. Beaton, C. Bombardier, Francis Guillemin
et al.
Personality trait structure as a human universal.
R. McCrae, P. Costa
3999 sitasi
en
Psychology, Medicine
The production of "new" and "similar" phones in a foreign language: evidence for the effect of equivalence classification
J. Flege
1141 sitasi
en
Political Science, Mathematics
Reexamining English Only in the ESL Classroom
E. Auerbach
Language discrimination by newborns: toward an understanding of the role of rhythm.
T. Nazzi, J. Bertoncini, J. Mehler
933 sitasi
en
Medicine, Psychology
Direction and impact of language bias in meta-analyses of controlled trials: empirical study.
P. Jüni, F. Holenstein, J. Sterne
et al.
The Sociocultural Turn and Its Challenges for Second Language Teacher Education.
K. Johnson
English-medium teaching in European higher education
James A. Coleman
In the global debates on English as international lingua franca or as ‘killer language’, the adoption of English as medium of instruction in Higher Education is raising increasing concern. Plurilingualism and multilingualism are embedded in the official policies of the European Union and Council of Europe, and the Bologna Process for harmonizing Higher Education promises ‘proper provision for linguistic diversity’. But even enthusiasts acknowledge the problems of implementing such policies in the face of an inexorable increase in the use of English. This survey draws on the most recent and sometimes disparate sources in an attempt to paint a comprehensive and up-to-date picture of the spread of English-medium teaching in Europe's universities. The article sets the changes in the context of accelerating globalization and marketization, and analyses the forces which are driving the adoption of English, and some of the problems which accelerating ‘Englishization’ of European Higher Education might create.
10. RESEARCH PERSPECTIVES ON TEACHING ENGLISH AS A LINGUA FRANCA
B. Seidlhofer
751 sitasi
en
Computer Science