arXiv Open Access 2024

BgGPT 1.0: Extending English-centric LLMs to other languages

Anton Alexandrov Veselin Raychev Dimitar I. Dimitrov Ce Zhang Martin Vechev +1 lainnya
Lihat Sumber

Abstrak

We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google's Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2's multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a commercial-friendly license, enabling broader adoption by researchers, companies, and hobbyists. Further, we establish a comprehensive set of benchmarks based on non-public educational data sources to evaluate models on Bulgarian language tasks as well as safety and chat capabilities. Our findings demonstrate the effectiveness of fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific AI applications while maintaining cross-lingual capabilities.

Topik & Kata Kunci

Penulis (6)

A

Anton Alexandrov

V

Veselin Raychev

D

Dimitar I. Dimitrov

C

Ce Zhang

M

Martin Vechev

K

Kristina Toutanova

Format Sitasi

Alexandrov, A., Raychev, V., Dimitrov, D.I., Zhang, C., Vechev, M., Toutanova, K. (2024). BgGPT 1.0: Extending English-centric LLMs to other languages. https://arxiv.org/abs/2412.10893

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓