arXiv Open Access 2023

Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

Philipp Ennen Po-Chun Hsu Chan-Jan Hsu Chang-Le Liu Yen-Chen Wu +4 lainnya
Lihat Sumber

Abstrak

In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability. We release all our models to the research community.

Topik & Kata Kunci

Penulis (9)

P

Philipp Ennen

P

Po-Chun Hsu

C

Chan-Jan Hsu

C

Chang-Le Liu

Y

Yen-Chen Wu

Y

Yin-Hsiang Liao

C

Chin-Tung Lin

D

Da-Shan Shiu

W

Wei-Yun Ma

Format Sitasi

Ennen, P., Hsu, P., Hsu, C., Liu, C., Wu, Y., Liao, Y. et al. (2023). Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results. https://arxiv.org/abs/2303.04715

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓