DOAJ Open Access 2021

Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain

Wenjia Li Kai Ma Qinjun Qiu Liang Wu Zhong Xie +2 lainnya

Abstrak

Abstract Chinese word segmentation (CWS) is the foundational work of geological report text mining and has an important influence on various tasks, such as named entity recognition and relation extraction. In recent years, the accuracy of the domain‐general CWS model has been limited by the domain and large scale of the training corpus, especially data on Chinese geological texts. Training these CWS models also requires much manually annotated data, which takes a large amount of time and effort. When applying these existing models/methods directly to the geoscience domain, the segmentation accuracy and performance will drop dramatically. To address this problem, we pretrain the Bidirectional Encoder Representations from Transformer (BERT), which can leverage unlabeled domain‐specific knowledge, on unlabeled Chinese geological text and then input them into a Bidirectional long short‐term memory and Conditional random field (BiLSTM‐CRF) model for extracting text features. Finally, the predicted tags are decoded by the CRF. The experimental results show that the F1 score of the proposed model reaches 96.2% on the constructed test set of geological texts. Additionally, experiments illustrate that our proposed model achieves comparable performance to that of other state‐of‐the‐art models, and the proposed cyclic self‐learning strategy can be further extended to other domains.

Topik & Kata Kunci

Penulis (7)

W

Wenjia Li

K

Kai Ma

Q

Qinjun Qiu

L

Liang Wu

Z

Zhong Xie

S

Sanfeng Li

S

Siqiong Chen

Format Sitasi

Li, W., Ma, K., Qiu, Q., Wu, L., Xie, Z., Li, S. et al. (2021). Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain. https://doi.org/10.1029/2021EA001673

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1029/2021EA001673
Informasi Jurnal
Tahun Terbit
2021
Sumber Database
DOAJ
DOI
10.1029/2021EA001673
Akses
Open Access ✓