The ARC-AGI-3 leaderboard displays a scatter plot with score in percent on the Y-axis and cost in dollars on the X-axis. All Frontier models score below 0.6 percent at costs ranging from $1,000 to $10,000. GPT-5.5 achieves a score of 0.4 percent at a cost of $10,000.No frontier model cracks the 1 percent mark on the ARC-AGI-3 leaderboard. GPT-5.5 leads with 0.4 percent at a cost of around $10,000. | Image: ARC Prize Foundation

According to the benchmark developers, the more interesting part is the reasoning behind the failures. The recorded “reasoning traces,” where the model documents its solution attempts, let them trace exactly where a model formed a hypothesis, where it rejected a correct one, and where it got stuck on a wrong one.

Models see the details but miss the big picture

The analysis identified three systematic error patterns that both models share, though in different ways. The most common pattern: models correctly pick up on local effects but can’t turn them into a working world model. A model might recognize that a certain action rotates an object, but it can’t figure out that the rotation determines which side receives a new value and that the object needs to be aligned before the next action.

According to the analysis, Opus 4.7 in the game cd82 already knew by step 4 that ACTION3 rotates a container. By step 6, it recognized that ACTION5 pours paint. But the model never connected these observations into the realization that it needed to align the bucket and then dip it to reproduce the target image in the top left.

Opus 4.7 understands that ACTION3 rotates objects but fails to grasp the overarching game mechanics. | Image: ARC Prize Foundation

A similar pattern showed up in cn04: Opus found the correct rotate-then-place interaction at step 23 but then optimized for the wrong target and started tracking a progress bar that didn’t exist.

Training data leads to false analogies

The second error pattern: models confuse unknown environments with familiar games from their training data. Across the runs, models repeatedly mistook unknown mechanics for Tetris, Frogger, Sokoban, Breakout, Pong, or Boulder Dash.

A loose visual resemblance spirals into a full gameplay theory, and the model wastes its actions on the wrong mechanics. GPT-5.5, for example, interpreted the ls20 environment as Breakout when it was actually about key combinations.

“Then again, it could be more like ‘Breakout,’ with bricks at the top and a paddle. The central object might be the ball,” the model wrote in its reasoning traces. This completely baseless assumption killed any chance of progress, a mistake a human familiar with Breakout would almost never make.

GPT-5.5 confuses the ls20 environment with the arcade classic Breakout. | Image: ARC Prize Foundation
Solving a level doesn’t mean understanding the game

The third error pattern might be the most consequential. Even when a model solves a level, that success doesn’t translate into deeper understanding because the model never checks why its strategy worked.

In ka59, Opus solved level 1 in 37 actions but based on a false theory: it assumed a click would teleport the active character. In reality, the game requires shape-matching and pushing. Level 1 only got solved because its simple structure happened to lead to the goal even with the wrong mechanics.

Since the model treated its success as confirmation of the teleportation theory, the wrong assumption hardened into “click each target to fill it” by level 2. The model didn’t recover from this mistake.

Opus 4.7 gets stuck in a click loop on ka59 after a wrong theory appeared to be confirmed by its level 1 success. | Image: ARC Prize Foundation

In ar25, the same pattern played out on a different level: Opus solved level 1 with a correct insight about mirrored motion and even spotted the new mechanics of a movable axis in level 2. But instead of following up on this correct observation, the model drifted into hallucinated rules and tried to “punch holes” or mirror objects. The right approach got buried under false hypotheses.

Both cases show that without examining why a level was won, models carry misconceptions into the next one.

Opus locks onto wrong theories, GPT-5.5 can’t commit to right ones

According to the analysis, Opus 4.7 is better at picking up mechanics early. On ar25, it identified the mirror structure almost immediately and solved level 1. But Opus tends to aggressively lock onto a false rule and never let go. In cn04, for example, it invented a progress and conversion theory and spent the early game clicking aimlessly within that framework. It had a theory, just the wrong one.

GPT-5.5 has the opposite problem. Its hypothesis generation is broader, so it’s more likely to land on the right idea but can’t turn it into an action plan. On ar25, it correctly identified the mirror effect but then kept expanding the possibility space, cycling through Tetris, Frogger, Pong, and Tower of Hanoi instead of committing. The model saw the right approach but couldn’t bring itself to follow through.

“The difference comes down to compression. Opus compressed its observations into a confident but wrong theory. GPT-5.5 had difficulty compressing at all,” writes Greg Kamradt from the ARC Prize Foundation.

Error patterns could matter beyond benchmarks

The ARC Prize Foundation argues that these error patterns are directly relevant to real AI agents. Each of the 135 environments was solved by at least two humans without any special training.

What makes the tasks hard for models is the same thing AI agents face in real work environments: navigating something completely unknown, forming a theory, testing it, and updating it when things don’t add up. Whether it’s an unfamiliar website, an internal tool, or an undocumented API.

“Scores tell you what a model achieved. Replays tell you whether or not the reasoning is likely to generalize,” Kamradt writes. The foundation plans to keep auditing every major frontier release with ARC-AGI-3.

Other studies point to the same conclusion

The analysis is likely to bolster AI critics who have argued for years that large language models are sophisticated pattern matchers that lack real understanding. When GPT-5.5 reflexively labels an unknown game environment as Breakout, it illustrates the idea that language models interpolate between learned patterns instead of forming abstract rules. And Opus 4.7 solving a level by chance and treating the false theory behind it as confirmed fits the criticism that current AI systems don’t build causal world models but chase statistical correlations.

Several other studies have reached similar conclusions. Apple researchers showed that reasoning models not only fail when complexity increases in controllable puzzle environments but paradoxically reason less. A large-scale cognitive science analysis of over 171,000 reasoning traces found that language models fall back on simple default strategies instead of actually reasoning when faced with hard tasks. And a medical study showed that even reasoning models current at the time of the study, such as DeepSeek-R1 and o3-mini, fail on slightly reworded questions, suggesting pattern matching rather than genuine understanding.