arXiv Open Access 2023

Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

Taehee Jeon Bongseok Yang Changhwan Kim Yoonseob Lim
Lihat Sumber

Abstrak

We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.

Topik & Kata Kunci

Penulis (4)

T

Taehee Jeon

B

Bongseok Yang

C

Changhwan Kim

Y

Yoonseob Lim

Format Sitasi

Jeon, T., Yang, B., Kim, C., Lim, Y. (2023). Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition. https://arxiv.org/abs/2311.03928

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓