arXiv Open Access 2026

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi Maggie Mi Nikolaos Aletras

Lihat Sumber

Abstrak

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

Topik & Kata Kunci

cs.CL

Penulis (3)

Atsuki Yamaguchi

Maggie Mi

Nikolaos Aletras

Format Sitasi

APA MLA BibTeX

Yamaguchi, A., Mi, M., Aletras, N. (2026). Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks. https://arxiv.org/abs/2601.03448

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓