arXiv Open Access 2021

gaBERT -- an Irish Language Model

James Barry Joachim Wagner Lauren Cassidy Alan Cowap Teresa Lynn +3 lainnya
Lihat Sumber

Abstrak

The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.

Topik & Kata Kunci

Penulis (8)

J

James Barry

J

Joachim Wagner

L

Lauren Cassidy

A

Alan Cowap

T

Teresa Lynn

A

Abigail Walsh

M

Mícheál J. Ó Meachair

J

Jennifer Foster

Format Sitasi

Barry, J., Wagner, J., Cassidy, L., Cowap, A., Lynn, T., Walsh, A. et al. (2021). gaBERT -- an Irish Language Model. https://arxiv.org/abs/2107.12930

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2021
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓