arXiv Open Access 2026

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi Maggie Mi Nikolaos Aletras
Lihat Sumber

Abstrak

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

Topik & Kata Kunci

Penulis (3)

A

Atsuki Yamaguchi

M

Maggie Mi

N

Nikolaos Aletras

Format Sitasi

Yamaguchi, A., Mi, M., Aletras, N. (2026). Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks. https://arxiv.org/abs/2601.03448

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓