arXiv Open Access 2025

Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Takashi Morita Timothy J. O'Donnell

Lihat Sumber

Abstrak

Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. From the perspective of language acquisition, however, such etymology-based generalizations raise learnability concerns, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our model also uncovered previously unrecognized features of the quasi-etymological clusters. Taken together with prior results from Japanese, our findings indicate that the proposed method provides a general, cross-linguistic approach to discovering etymological structure from phonotactic cues in the lexicon.

Topik & Kata Kunci

cs.CL

Penulis (2)

Takashi Morita

Timothy J. O'Donnell

Format Sitasi

APA MLA BibTeX

Morita, T., O'Donnell, T.J. (2025). Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters. https://arxiv.org/abs/2504.11770

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓