arXiv Open Access 2023

On the Impact of Cross-Domain Data on German Language Models

Amin Dada Aokun Chen Cheng Peng Kaleb E Smith Ahmad Idrissi-Yaghir +10 lainnya
Lihat Sumber

Abstrak

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen

Topik & Kata Kunci

Penulis (15)

A

Amin Dada

A

Aokun Chen

C

Cheng Peng

K

Kaleb E Smith

A

Ahmad Idrissi-Yaghir

C

Constantin Marc Seibold

J

Jianning Li

L

Lars Heiliger

X

Xi Yang

C

Christoph M. Friedrich

D

Daniel Truhn

J

Jan Egger

J

Jiang Bian

J

Jens Kleesiek

Y

Yonghui Wu

Format Sitasi

Dada, A., Chen, A., Peng, C., Smith, K.E., Idrissi-Yaghir, A., Seibold, C.M. et al. (2023). On the Impact of Cross-Domain Data on German Language Models. https://arxiv.org/abs/2310.07321

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓