Semantic Scholar Open Access 2018 4026 sitasi

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Taku Kudo John Richardson

Lihat Sumber DOI

Abstrak

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

Topik & Kata Kunci

Computer Science

Penulis (2)

Taku Kudo

John Richardson

Format Sitasi

APA MLA BibTeX

Kudo, T., Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. https://doi.org/10.18653/v1/D18-2012

Akses Cepat

Lihat di Sumber doi.org/10.18653/v1/D18-2012

Informasi Jurnal

Tahun Terbit: 2018
Bahasa: en
Total Sitasi: 4026×
Sumber Database: Semantic Scholar
DOI: 10.18653/v1/D18-2012
Akses: Open Access ✓