Semantic Scholar Open Access 2020 64 sitasi

Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park Katherine J. Zhang Coleman Haley K. Steimel Han Liu +1 lainnya

Abstrak

Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.

Topik & Kata Kunci

Penulis (6)

H

Hyunji Hayley Park

K

Katherine J. Zhang

C

Coleman Haley

K

K. Steimel

H

Han Liu

L

Lane Schwartz

Format Sitasi

Park, H.H., Zhang, K.J., Haley, C., Steimel, K., Liu, H., Schwartz, L. (2020). Morphology Matters: A Multilingual Language Modeling Analysis. https://doi.org/10.1162/tacl_a_00365

Akses Cepat

Lihat di Sumber doi.org/10.1162/tacl_a_00365
Informasi Jurnal
Tahun Terbit
2020
Bahasa
en
Total Sitasi
64×
Sumber Database
Semantic Scholar
DOI
10.1162/tacl_a_00365
Akses
Open Access ✓