arXiv Open Access 2025

Morpheme Induction for Emergent Language

Brendon Boldt David Mortensen
Lihat Sumber

Abstrak

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Topik & Kata Kunci

Penulis (2)

B

Brendon Boldt

D

David Mortensen

Format Sitasi

Boldt, B., Mortensen, D. (2025). Morpheme Induction for Emergent Language. https://arxiv.org/abs/2510.03439

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓