arXiv Open Access 2026

Attention Head Entropy of LLMs Predicts Answer Correctness

Sophie Ostmeier Brian Axelrod Maya Varma Asad Aali Yabin Zhang +4 lainnya

Lihat Sumber

Abstrak

Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Topik & Kata Kunci

cs.LG

Penulis (9)

Sophie Ostmeier

Brian Axelrod

Maya Varma

Asad Aali

Yabin Zhang

Magdalini Paschali

Sanmi Koyejo

Curtis Langlotz

Akshay Chaudhari

Format Sitasi

APA MLA BibTeX

Ostmeier, S., Axelrod, B., Varma, M., Aali, A., Zhang, Y., Paschali, M. et al. (2026). Attention Head Entropy of LLMs Predicts Answer Correctness. https://arxiv.org/abs/2602.13699

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓