arXiv Open Access 2025

SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

Mohammed Ali Abdelrahman Abdallah Adam Jatowt

Lihat Sumber

Abstrak

The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust dataset of over 195,000 diverse factoid and non-factoid QA pairs, whose effectiveness is demonstrated by initial fine-tuning experiments where a compact 8B parameter model significantly outperforms much larger state-of-the-art models. SustainableQA proves to be a highly effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance data.

Topik & Kata Kunci

cs.IR

Penulis (3)

Mohammed Ali

Abdelrahman Abdallah

Adam Jatowt

Format Sitasi

APA MLA BibTeX

Ali, M., Abdallah, A., Jatowt, A. (2025). SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting. https://arxiv.org/abs/2508.03000

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓