Hasil "English language"

S2 Open Access 2021

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Arun Babu, Changhan Wang, Andros Tjandra et al.

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.

982 sitasi en Computer Science, Engineering

Detail DOI Sumber

S2 Open Access 2020

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Di Jin, Eileen Pan, Nassim Oufattole et al.

Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.

1525 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2019

Common Voice: A Massively-Multilingual Speech Corpus

Rosana Ardila, Megan Branson, Kelly Davis et al.

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

2253 sitasi en Computer Science

Detail Sumber

S2 Open Access 2018

XNLI: Evaluating Cross-lingual Sentence Representations

Alexis Conneau, Guillaume Lample, Ruty Rinott et al.

State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.

1574 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2017

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams, Nikita Nangia, Samuel R. Bowman

This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), improving upon available resources in both its coverage and difficulty. MultiNLI accomplishes this by offering data from ten distinct genres of written and spoken English, making it possible to evaluate systems on nearly the full complexity of the language, while supplying an explicit setting for evaluating cross-genre domain adaptation. In addition, an evaluation using existing machine learning models designed for the Stanford NLI corpus shows that it represents a substantially more difficult task than does that corpus, despite the two showing similar levels of inter-annotator agreement.

4956 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2017

Word Translation Without Parallel Data

Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato et al.

State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.

1754 sitasi en Computer Science

Detail Sumber

S2 Open Access 2017

Unsupervised Machine Translation Using Monolingual Corpora Only

Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato

Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.

1143 sitasi en Computer Science

Detail Sumber

S2 Open Access 2014

A SICK cure for the evaluation of compositional distributional semantic models

M. Marelli, Stefano Menini, Marco Baroni et al.

931 sitasi en Computer Science

Detail Sumber

S2 Open Access 2010

A method for linguistic metaphor identification : from MIP to MIPVU

G. Steen

1320 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2009

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Marco Baroni, Silvia Bernardini, Adriano Ferraresi et al.

1268 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 1990

The Alchemy of English: The Spread, Functions, and Models of Non-native Englishes

Braj B. Kachru

1026 sitasi en Art

Detail DOI Sumber

S2 Open Access 2000

Guidelines for the process of cross-cultural adaptation of self-report measures.

D. Beaton, C. Bombardier, Francis Guillemin et al.

12166 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 1997

Personality trait structure as a human universal.

R. McCrae, P. Costa

3999 sitasi en Psychology, Medicine

Detail DOI Sumber

S2 Open Access 1987

The production of "new" and "similar" phones in a foreign language: evidence for the effect of equivalence classification

J. Flege

1141 sitasi en Political Science, Mathematics

Detail DOI Sumber

S2 Open Access 1993

Reexamining English Only in the ESL Classroom

E. Auerbach

943 sitasi en Sociology

Detail DOI Sumber

S2 Open Access 1998

Language discrimination by newborns: toward an understanding of the role of rhythm.

T. Nazzi, J. Bertoncini, J. Mehler

933 sitasi en Medicine, Psychology

Detail DOI Sumber

S2 Open Access 2002

Direction and impact of language bias in meta-analyses of controlled trials: empirical study.

P. Jüni, F. Holenstein, J. Sterne et al.

860 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2006

The Sociocultural Turn and Its Challenges for Second Language Teacher Education.

K. Johnson

821 sitasi en Sociology

Detail DOI Sumber

S2 Open Access 2006

English-medium teaching in European higher education

James A. Coleman

In the global debates on English as international lingua franca or as ‘killer language’, the adoption of English as medium of instruction in Higher Education is raising increasing concern. Plurilingualism and multilingualism are embedded in the official policies of the European Union and Council of Europe, and the Bologna Process for harmonizing Higher Education promises ‘proper provision for linguistic diversity’. But even enthusiasts acknowledge the problems of implementing such policies in the face of an inexorable increase in the use of English. This survey draws on the most recent and sometimes disparate sources in an attempt to paint a comprehensive and up-to-date picture of the spread of English-medium teaching in Europe's universities. The article sets the changes in the context of accelerating globalization and marketization, and analyses the forces which are driving the adoption of English, and some of the problems which accelerating ‘Englishization’ of European Higher Education might create.

759 sitasi en Sociology

Detail DOI Sumber

S2 Open Access 2004

10. RESEARCH PERSPECTIVES ON TEACHING ENGLISH AS A LINGUA FRANCA

B. Seidlhofer

751 sitasi en Computer Science

Detail DOI Sumber

Hasil untuk "English language"