arXiv Open Access 2026

Diagnosing Pathological Chain-of-Thought in Reasoning Models

Manqing Liu David Williams-King Ida Caspary Linh Le Hannes Whittingham +3 lainnya

Lihat Sumber

Abstrak

Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. However, CoT reasoning may exhibit failure modes that we note as pathologies, which prevent it from being useful for monitoring. Prior work has identified three distinct pathologies: post-hoc rationalization, where models generate plausible explanations backwards from predetermined answers; encoded reasoning, where intermediate steps conceal information within seemingly interpretable text; and internalized reasoning, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we create a set of concrete metrics that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop model organisms deliberately trained to exhibit specific CoT pathologies. Our work provides a practical toolkit for assessing CoT pathologies, with direct implications for training-time monitoring.

Topik & Kata Kunci

cs.AI

Penulis (8)

Manqing Liu

David Williams-King

Ida Caspary

Linh Le

Hannes Whittingham

Puria Radmard

Cameron Tice

Edward James Young

Format Sitasi

APA MLA BibTeX

Liu, M., Williams-King, D., Caspary, I., Le, L., Whittingham, H., Radmard, P. et al. (2026). Diagnosing Pathological Chain-of-Thought in Reasoning Models. https://arxiv.org/abs/2602.13904

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓