arXiv Open Access 2026

Attention Head Entropy of LLMs Predicts Answer Correctness

Sophie Ostmeier Brian Axelrod Maya Varma Asad Aali Yabin Zhang +4 lainnya
Lihat Sumber

Abstrak

Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Topik & Kata Kunci

Penulis (9)

S

Sophie Ostmeier

B

Brian Axelrod

M

Maya Varma

A

Asad Aali

Y

Yabin Zhang

M

Magdalini Paschali

S

Sanmi Koyejo

C

Curtis Langlotz

A

Akshay Chaudhari

Format Sitasi

Ostmeier, S., Axelrod, B., Varma, M., Aali, A., Zhang, Y., Paschali, M. et al. (2026). Attention Head Entropy of LLMs Predicts Answer Correctness. https://arxiv.org/abs/2602.13699

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓