MLS: A Large-Scale Multilingual Dataset for Speech Research
Vineel Pratap, Qiantong Xu, Anuroop Sriram
et al.
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at this http URL.
733 sitasi
en
Engineering, Computer Science
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
J. Clark, Eunsol Choi, Michael Collins
et al.
Abstract Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.
718 sitasi
en
Computer Science
Phrase-Based & Neural Unsupervised Machine Translation
Guillaume Lample, Myle Ott, Alexis Conneau
et al.
Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.
709 sitasi
en
Computer Science
MLQA: Evaluating Cross-lingual Extractive Question Answering
Patrick Lewis, Barlas Oğuz, Ruty Rinott
et al.
Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making building QA systems that work well in other languages challenging. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA has over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average. We evaluate state-of-the-art cross-lingual models and machine-translation-based baselines on MLQA. In all cases, transfer results are shown to be significantly behind training-language performance.
589 sitasi
en
Computer Science
Integrating Educational Technology Into Teaching
M. Roblyer, A. Doering
1454 sitasi
en
Computer Science
The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation
G. Doddington, Alexis Mitchell, Mark A. Przybocki
et al.
1411 sitasi
en
Computer Science
Nation And Narration
Homi K. Bhabha
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).
1512 sitasi
en
Computer Science
Writing Science: Literacy And Discursive Power
M. Halliday, Jeannett Martin
1862 sitasi
en
Political Science
The Sketch Engine
A. Kilgarriff, P. Rychlý, Pavel Smrz
et al.
1227 sitasi
en
Computer Science
Relating events in narrative: a crosslinguistic developmental study
A. McCabe
1422 sitasi
en
Medicine, Psychology
NATURAL SEQUENCES IN CHILD SECOND LANGUAGE ACQUISITION1
H. Dulay, Marina K. Burt
1166 sitasi
en
Psychology
Evolution of language.
W. Strange
Reflexes of grammar in patterns of language change
A. Kroch
1059 sitasi
en
Psychology
The Oxford English Dictionary
J. Simpson, E. Weiner
Cross-lingual Name Tagging and Linking for 282 Languages
Xiaoman Pan, Boliang Zhang, Jonathan May
et al.
The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating “silver-standard” annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.
562 sitasi
en
Computer Science
Recurrent Neural Network Grammars
Chris Dyer, A. Kuncoro, Miguel Ballesteros
et al.
We introduce recurrent neural network grammars, probabilistic models of sentences with explicit phrase structure. We explain efficient inference procedures that allow application to both parsing and language modeling. Experiments show that they provide better parsing in English than any single previously published supervised generative model and better language modeling than state-of-the-art sequential RNNs in English and Chinese.
542 sitasi
en
Computer Science
The everyday language of white racism
JANE H. Hill
Developing Literacy in Second-Language Learners: Report of the National Literacy Panel on Language-Minority Children and Youth. Executive Summary.
D. August, T. Shanahan
767 sitasi
en
Psychology, Computer Science
Properties of dual language exposure that influence 2-year-olds' bilingual proficiency.
S. Place, E. Hoff
546 sitasi
en
Psychology, Medicine