Brutkey

Jonathan Yu
@jawnsy@mastodon.social
🌐

"This framework reveals that benchmarks are only valid if LLMs misunderstand concepts in the same ways that humans do. If models deviate from human-like misunderstandings, they can still correctly answer keystone questions without truly understanding the underlying concept, producing potemkins."

https://arxiv.org/html/2506.21521v1