arXiv Open Access 2025

SEA-LION: Southeast Asian Languages in One Network

Raymond Ng Thanh Ngan Nguyen Yuli Huang Ngee Chia Tai Wai Yi Leong +26 lainnya

Lihat Sumber

Abstrak

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

Topik & Kata Kunci

cs.CL

Penulis (31)

Raymond Ng

Thanh Ngan Nguyen

Yuli Huang

Ngee Chia Tai

Wai Yi Leong

Wei Qi Leong

Xianbin Yong

Jian Gang Ngui

Yosephine Susanto

Nicholas Cheng

Hamsawardhini Rengarajan

Peerat Limkonchotiwat

Adithya Venkatadri Hulagadri

Kok Wai Teng

Yeo Yeow Tong

Bryan Siow

Wei Yi Teo

Wayne Lau

Choon Meng Tan

Brandon Ong

Zhi Hao Ong

Jann Railey Montalan

Adwin Chan

Sajeban Antonyrex

Ren Lee

Esther Choa

David Ong Tat-Wee

Bing Jie Darius Liu

William Chandra Tjhi

Erik Cambria

Leslie Teo

Format Sitasi

APA MLA BibTeX

Ng, R., Nguyen, T.N., Huang, Y., Tai, N.C., Leong, W.Y., Leong, W.Q. et al. (2025). SEA-LION: Southeast Asian Languages in One Network. https://arxiv.org/abs/2504.05747

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓