SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting
Abstrak
The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust dataset of over 195,000 diverse factoid and non-factoid QA pairs, whose effectiveness is demonstrated by initial fine-tuning experiments where a compact 8B parameter model significantly outperforms much larger state-of-the-art models. SustainableQA proves to be a highly effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance data.
Topik & Kata Kunci
Penulis (3)
Mohammed Ali
Abdelrahman Abdallah
Adam Jatowt
Akses Cepat
- Tahun Terbit
- 2025
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓