arXiv Open Access 2024

Classification of Radiological Text in Small and Imbalanced Datasets in a Non-English Language

Vincent Beliveau Helene Kaas Martin Prener Claes N. Ladefoged Desmond Elliott +3 lainnya

Lihat Sumber

Abstrak

Natural language processing (NLP) in the medical domain can underperform in real-world applications involving small datasets in a non-English language with few labeled samples and imbalanced classes. There is yet no consensus on how to approach this problem. We evaluated a set of NLP models including BERT-like transformers, few-shot learning with sentence transformers (SetFit), and prompted large language models (LLM), using three datasets of radiology reports on magnetic resonance images of epilepsy patients in Danish, a low-resource language. Our results indicate that BERT-like models pretrained in the target domain of radiology reports currently offer the optimal performances for this scenario. Notably, the SetFit and LLM models underperformed compared to BERT-like models, with LLM performing the worst. Importantly, none of the models investigated was sufficiently accurate to allow for text classification without any supervision. However, they show potential for data filtering, which could reduce the amount of manual labeling required.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (8)

Vincent Beliveau

Helene Kaas

Martin Prener

Claes N. Ladefoged

Desmond Elliott

Gitte M. Knudsen

Lars H. Pinborg

Melanie Ganz

Format Sitasi

APA MLA BibTeX

Beliveau, V., Kaas, H., Prener, M., Ladefoged, C.N., Elliott, D., Knudsen, G.M. et al. (2024). Classification of Radiological Text in Small and Imbalanced Datasets in a Non-English Language. https://arxiv.org/abs/2409.20147

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓