arXiv Open Access 2025

New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models

Julia Wunderle Anton Ehrmanntraut Jan Pfister Fotis Jannidis Andreas Hotho
Lihat Sumber

Abstrak

Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LLäMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LLäMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.

Topik & Kata Kunci

Penulis (5)

J

Julia Wunderle

A

Anton Ehrmanntraut

J

Jan Pfister

F

Fotis Jannidis

A

Andreas Hotho

Format Sitasi

Wunderle, J., Ehrmanntraut, A., Pfister, J., Jannidis, F., Hotho, A. (2025). New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models. https://arxiv.org/abs/2505.13136

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓