arXiv Open Access 2025

New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models

Julia Wunderle Anton Ehrmanntraut Jan Pfister Fotis Jannidis Andreas Hotho

Lihat Sumber

Abstrak

Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LLäMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LLäMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.

Topik & Kata Kunci

cs.CL cs.AI cs.LG

Penulis (5)

Julia Wunderle

Anton Ehrmanntraut

Jan Pfister

Fotis Jannidis

Andreas Hotho

Format Sitasi

APA MLA BibTeX

Wunderle, J., Ehrmanntraut, A., Pfister, J., Jannidis, F., Hotho, A. (2025). New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models. https://arxiv.org/abs/2505.13136

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓