Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition
Abstrak
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.
Topik & Kata Kunci
Penulis (4)
Taehee Jeon
Bongseok Yang
Changhwan Kim
Yoonseob Lim
Akses Cepat
- Tahun Terbit
- 2023
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓