Traditional Chinese medicine text segmentation model with multi-metadata embedding based on Bidirectional LSTM
Abstrak
In recent years, natural language processing (NLP) has gained increasing attention in traditional Chinese medicine (TCM). Chinese word segmentation (CWS) is a fundamental task in NLP, focusing on accurately segmenting words based on context. However, TCM texts often exhibit incompleteness, ambiguity, and structural challenges, leading to suboptimal segmentation results. The limited size of available datasets and the inadequacy of existing models to capture the unique characteristics of TCM texts further exacerbate this issue. This paper proposes a multi-metadata embedding TCM text segmentation model (MBMC) based on a Bidirectional Long Short-Term Memory Network (BiLSTM) combined with a Multi-Head Self-Attention Mechanism (MHA). The MBMC model first constructs a comprehensive corpus of traditional Chinese medicine (TCM2023) explicitly designed for CWS tasks. This corpus fills a significant gap in the existing literature, as there were previously no dedicated datasets for TCM word segmentation. It employs Word2Vec and CNN networks to extract features from Chinese characters and their structural information, and uses them as inputs to the neural network to improve contextual feature extraction. Additionally, Conditional Random Fields (CRF) are utilized to refine segmentation outputs. On the TCM2023 dataset, the MBMC model achieved an F1-score of 95.64%, providing a reliable foundation for downstream NLP tasks in the TCM domain. Furthermore, the MBMC model obtained F1-scores of 98.93% and 97.50% on the MSRA and PKU datasets, respectively. It outperformed several pre-trained models, including BERT and Glyce-BERT, while requiring less time and memory.
Topik & Kata Kunci
Penulis (4)
Dangguo Shao
Nuoyun Duan
Lei Ma
Sanli Yi
Akses Cepat
PDF tidak tersedia langsung
Cek di sumber asli →- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1016/j.ijcce.2025.10.006
- Akses
- Open Access ✓