arXiv Open Access 2025

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

Huiyao Chen Yi Yang Yinghui Li Meishan Zhang Baotian Hu +1 lainnya
Lihat Sumber

Abstrak

Existing long-document question answering systems typically process texts as flat sequences or use heuristic chunking, which overlook the discourse structures that naturally guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) for long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: language-universal discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Extensive experiments on four datasets demonstrate consistent improvements over existing approaches through the incorporation of discourse structure, across multiple genres and languages. Moreover, the proposed framework exhibits strong robustness across diverse document types and linguistic settings.

Topik & Kata Kunci

Penulis (6)

H

Huiyao Chen

Y

Yi Yang

Y

Yinghui Li

M

Meishan Zhang

B

Baotian Hu

M

Min Zhang

Format Sitasi

Chen, H., Yang, Y., Li, Y., Zhang, M., Hu, B., Zhang, M. (2025). Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering. https://arxiv.org/abs/2506.06313

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓