arXiv Open Access 2025

FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Johann Pignat Milena Vucetic Christophe Gaudet-Blavignac Jamil Zaghir Amandine Stettler +6 lainnya
Lihat Sumber

Abstrak

Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

Topik & Kata Kunci

Penulis (11)

J

Johann Pignat

M

Milena Vucetic

C

Christophe Gaudet-Blavignac

J

Jamil Zaghir

A

Amandine Stettler

F

Fanny Amrein

J

Jonatan Bonjour

J

Jean-Philippe Goldman

O

Olivier Michielin

C

Christian Lovis

M

Mina Bjelogrlic

Format Sitasi

Pignat, J., Vucetic, M., Gaudet-Blavignac, C., Zaghir, J., Stettler, A., Amrein, F. et al. (2025). FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation. https://arxiv.org/abs/2510.13873

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓