SEA-LION: Southeast Asian Languages in One Network
Abstrak
Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
Topik & Kata Kunci
Penulis (31)
Raymond Ng
Thanh Ngan Nguyen
Yuli Huang
Ngee Chia Tai
Wai Yi Leong
Wei Qi Leong
Xianbin Yong
Jian Gang Ngui
Yosephine Susanto
Nicholas Cheng
Hamsawardhini Rengarajan
Peerat Limkonchotiwat
Adithya Venkatadri Hulagadri
Kok Wai Teng
Yeo Yeow Tong
Bryan Siow
Wei Yi Teo
Wayne Lau
Choon Meng Tan
Brandon Ong
Zhi Hao Ong
Jann Railey Montalan
Adwin Chan
Sajeban Antonyrex
Ren Lee
Esther Choa
David Ong Tat-Wee
Bing Jie Darius Liu
William Chandra Tjhi
Erik Cambria
Leslie Teo
Akses Cepat
- Tahun Terbit
- 2025
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓