Thread | Brutkey

"This framework reveals that benchmarks are only valid if LLMs misunderstand concepts in the same ways that humans do. If models deviate from human-like misunderstandings, they can still correctly answer keystone questions without truly understanding the underlying concept, producing potemkins."

https://arxiv.org/html/2506.21521v1