arXiv Open Access 2024

Towards Building Multilingual Language Model for Medicine

Pengcheng Qiu Chaoyi Wu Xiaoman Zhang Weixiong Lin Haicheng Wang +3 lainnya
Lihat Sumber

Abstrak

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Topik & Kata Kunci

Penulis (8)

P

Pengcheng Qiu

C

Chaoyi Wu

X

Xiaoman Zhang

W

Weixiong Lin

H

Haicheng Wang

Y

Ya Zhang

Y

Yanfeng Wang

W

Weidi Xie

Format Sitasi

Qiu, P., Wu, C., Zhang, X., Lin, W., Wang, H., Zhang, Y. et al. (2024). Towards Building Multilingual Language Model for Medicine. https://arxiv.org/abs/2402.13963

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓