arXiv Open Access 2025

Japanese Children's Riddles as a Benchmark for Machine Insight and Metacognition

Masaharu Mizumoto Dat Nguyen Zhiheng Han Jiyuan Fang Heyuan Guan +4 lainnya
Lihat Sumber

Abstrak

Benchmark saturation and contamination have obscured genuine advances in reasoning for large language models (LLMs). We introduce NazoNazo Benchmark, a low-cost, renewable test built from Japanese children's riddles that demand insight-based reasoning, or representational shifts rather than knowledge recall. We evaluate 38 frontier LLMs (2023-2025) on 201 riddles and a 120-item human-comparison subset, finding that non-reasoning models average 7.6%, reasoning models 17.6%, and humans ~53% accuracy. Importantly, thought-log analysis reveals that reasoning in Japanese did not necessarily improve accuracy, indicating that language understanding alone is insufficient for insight reasoning. Notably, models sometimes generated correct candidates but failed to endorse them, suggesting weak metacognitive control rather than a lack of knowledge. This "verification failure" indicates that CoT outputs can reflect genuine intermediate reasoning states rather than post-hoc rationalizations. By exposing this metacognitive bottleneck - models' inability to recognize when they are right - the benchmark provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation. NazoNazo Benchmark thus offers not only a fresh challenge to current LLMs but also a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.

Topik & Kata Kunci

Penulis (9)

M

Masaharu Mizumoto

D

Dat Nguyen

Z

Zhiheng Han

J

Jiyuan Fang

H

Heyuan Guan

X

Xingfu Li

N

Naoya Shiraishi

Y

Yo Nakawake

L

Le Minh Nguyen

Format Sitasi

Mizumoto, M., Nguyen, D., Han, Z., Fang, J., Guan, H., Li, X. et al. (2025). Japanese Children's Riddles as a Benchmark for Machine Insight and Metacognition. https://arxiv.org/abs/2509.14704

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓