arXiv Open Access 2024

LongKey: Keyphrase Extraction for Long Documents

Jeovane Honorio Alves Radu State Cinthia Obladen de Almendra Freitas Jean Paul Barddal

Lihat Sumber

Abstrak

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

Topik & Kata Kunci

cs.CL cs.AI cs.IR cs.LG

Penulis (4)

Jeovane Honorio Alves

Radu State

Cinthia Obladen de Almendra Freitas

Jean Paul Barddal

Format Sitasi

APA MLA BibTeX

Alves, J.H., State, R., Freitas, C.O.d.A., Barddal, J.P. (2024). LongKey: Keyphrase Extraction for Long Documents. https://arxiv.org/abs/2411.17863

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓