DOAJ Open Access 2026

Traditional Chinese medicine text segmentation model with multi-metadata embedding based on Bidirectional LSTM

Dangguo Shao Nuoyun Duan Lei Ma Sanli Yi

Abstrak

In recent years, natural language processing (NLP) has gained increasing attention in traditional Chinese medicine (TCM). Chinese word segmentation (CWS) is a fundamental task in NLP, focusing on accurately segmenting words based on context. However, TCM texts often exhibit incompleteness, ambiguity, and structural challenges, leading to suboptimal segmentation results. The limited size of available datasets and the inadequacy of existing models to capture the unique characteristics of TCM texts further exacerbate this issue. This paper proposes a multi-metadata embedding TCM text segmentation model (MBMC) based on a Bidirectional Long Short-Term Memory Network (BiLSTM) combined with a Multi-Head Self-Attention Mechanism (MHA). The MBMC model first constructs a comprehensive corpus of traditional Chinese medicine (TCM2023) explicitly designed for CWS tasks. This corpus fills a significant gap in the existing literature, as there were previously no dedicated datasets for TCM word segmentation. It employs Word2Vec and CNN networks to extract features from Chinese characters and their structural information, and uses them as inputs to the neural network to improve contextual feature extraction. Additionally, Conditional Random Fields (CRF) are utilized to refine segmentation outputs. On the TCM2023 dataset, the MBMC model achieved an F1-score of 95.64%, providing a reliable foundation for downstream NLP tasks in the TCM domain. Furthermore, the MBMC model obtained F1-scores of 98.93% and 97.50% on the MSRA and PKU datasets, respectively. It outperformed several pre-trained models, including BERT and Glyce-BERT, while requiring less time and memory.

Penulis (4)

D

Dangguo Shao

N

Nuoyun Duan

L

Lei Ma

S

Sanli Yi

Format Sitasi

Shao, D., Duan, N., Ma, L., Yi, S. (2026). Traditional Chinese medicine text segmentation model with multi-metadata embedding based on Bidirectional LSTM. https://doi.org/10.1016/j.ijcce.2025.10.006

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1016/j.ijcce.2025.10.006
Informasi Jurnal
Tahun Terbit
2026
Sumber Database
DOAJ
DOI
10.1016/j.ijcce.2025.10.006
Akses
Open Access ✓