arXiv Open Access 2025

Japanese Children's Riddles as a Benchmark for Machine Insight and Metacognition

Masaharu Mizumoto Dat Nguyen Zhiheng Han Jiyuan Fang Heyuan Guan +4 lainnya

Lihat Sumber

Abstrak

Benchmark saturation and contamination have obscured genuine advances in reasoning for large language models (LLMs). We introduce NazoNazo Benchmark, a low-cost, renewable test built from Japanese children's riddles that demand insight-based reasoning, or representational shifts rather than knowledge recall. We evaluate 38 frontier LLMs (2023-2025) on 201 riddles and a 120-item human-comparison subset, finding that non-reasoning models average 7.6%, reasoning models 17.6%, and humans ~53% accuracy. Importantly, thought-log analysis reveals that reasoning in Japanese did not necessarily improve accuracy, indicating that language understanding alone is insufficient for insight reasoning. Notably, models sometimes generated correct candidates but failed to endorse them, suggesting weak metacognitive control rather than a lack of knowledge. This "verification failure" indicates that CoT outputs can reflect genuine intermediate reasoning states rather than post-hoc rationalizations. By exposing this metacognitive bottleneck - models' inability to recognize when they are right - the benchmark provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation. NazoNazo Benchmark thus offers not only a fresh challenge to current LLMs but also a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.

Topik & Kata Kunci

cs.AI

Penulis (9)

Masaharu Mizumoto

Dat Nguyen

Zhiheng Han

Jiyuan Fang

Heyuan Guan

Xingfu Li

Naoya Shiraishi

Yo Nakawake

Le Minh Nguyen

Format Sitasi

APA MLA BibTeX

Mizumoto, M., Nguyen, D., Han, Z., Fang, J., Guan, H., Li, X. et al. (2025). Japanese Children's Riddles as a Benchmark for Machine Insight and Metacognition. https://arxiv.org/abs/2509.14704

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓