arXiv Open Access 2025

Unraveling Syntax: How Language Models Learn Context-Free Grammars

Laura Ying Schulz Daniel Mitropolsky Tomaso Poggio
Lihat Sumber

Abstrak

While large models achieve impressive results, their learning dynamics are far from understood. Many domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely "subgrammars". We first define subgrammars, and prove a set of fundamental theorems regarding language modeling and subgrammars. We show that language modeling loss (or equivalently the Kullback-Leibler divergence) recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. We also prove that the constant in this linear recurrence is a function of the expected recursion, a notion we introduce. We show that under additional assumptions, parametrized models learn subgrammars in parallel. Empirically, we confirm that small transformers learn subgrammars in parallel, unlike children, who first master simple substructures. We also briefly explore several other questions regarding subgrammars. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently lead to internal representations that better reflect the grammar's substructure in all cases; we also observe persistent difficulty with deeper recursion, a limitation that appears even of large language models.

Topik & Kata Kunci

Penulis (3)

L

Laura Ying Schulz

D

Daniel Mitropolsky

T

Tomaso Poggio

Format Sitasi

Schulz, L.Y., Mitropolsky, D., Poggio, T. (2025). Unraveling Syntax: How Language Models Learn Context-Free Grammars. https://arxiv.org/abs/2510.02524

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓