{"id":931530,"date":"2026-05-01T21:48:26","date_gmt":"2026-05-01T21:48:26","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/931530\/"},"modified":"2026-05-01T21:48:26","modified_gmt":"2026-05-01T21:48:26","slug":"analyzing-gpt-5-5-opus-4-7-with-arc-agi-3","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/931530\/","title":{"rendered":"Analyzing GPT-5.5 &#038; Opus 4.7 with ARC-AGI-3"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/v3_analysis_gpt5-5-opus4-7.png\" alt=\"Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3\" style=\"max-width:105%;width:105%;position:relative;left:50%;transform:translateX(-50%);display:block\"\/><\/p>\n<p>AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome.<\/p>\n<p>This week we went through 160 replays and reasoning traces from <a href=\"https:\/\/openai.com\/index\/introducing-gpt-5-5\/\" target=\"_blank\" rel=\"noopener noreferrer\">OpenAI\u2019s GPT-5.5<\/a> and <a href=\"https:\/\/www.anthropic.com\/news\/claude-opus-4-7\" target=\"_blank\" rel=\"noopener noreferrer\">Anthropic\u2019s Opus 4.7<\/a> attempting novel, long-horizon environments. The scores were just one data point, but the interesting story is how they achieved their score.<\/p>\n<p>Today we\u2019re open-sourcing our analysis package.<\/p>\n<tr>ModelARC-AGI-3 Score*Public Demo Replays<\/tr>\n<tr>\n<td>GPT-5.5<\/td>\n<td>0.43%<\/td>\n<td><a href=\"https:\/\/arcprize.org\/scorecards\/model\/openai-gpt-5-5-2026-04-23-high\" target=\"_blank\" rel=\"noopener noreferrer\">Link<\/a><\/td>\n<\/tr>\n<tr>\n<td>Opus 4.7<\/td>\n<td>0.18%<\/td>\n<td><a href=\"https:\/\/arcprize.org\/scorecards\/model\/anthropic-opus-4-7-high\" target=\"_blank\" rel=\"noopener noreferrer\">Link<\/a><\/td>\n<\/tr>\n<p>* Scores tested with the semi-private dataset<\/p>\n<p>With ARC-AGI-3 we can replay every action alongside the model&#8217;s reasoning to see where it formed a hypothesis, where it abandoned a correct one, where it locked onto a wrong idea and couldn&#8217;t let go.<\/p>\n<p>We found 3 common failure modes:<\/p>\n<ul>\n<li><strong>True Local Effect, False World Model<\/strong> &#8211; The models understand which action produced a change, but they fail to translate the effect into a global rule<\/li>\n<li><strong>Wrong Level of Abstraction From Training Data<\/strong> &#8211; The models mistake an ARC-AGI-3 environment for another game<\/li>\n<li><strong>Solved The Level, Didn\u2019t Learn The Game<\/strong> &#8211; Even if a model beat a level, it\u2019s unable to use that reward signal to enforce the correct actions<\/li>\n<\/ul>\n<p>ARC-AGI-3 as an analysis tool<\/p>\n<p><a href=\"https:\/\/arcprize.org\/arc-agi\/3\" target=\"_blank\" rel=\"noopener noreferrer\">ARC-AGI-3<\/a> is a series of 135 novel environments. Each was hand-crafted by a human to test the ability of AI models to adapt to novelty. <a href=\"https:\/\/arcprize.org\/tasks\/ls20\" target=\"_blank\" rel=\"noopener noreferrer\">Play them<\/a> yourself or watch our <a href=\"https:\/\/www.youtube.com\/watch?v=f_xT45Pi0UQ\" target=\"_blank\" rel=\"noopener noreferrer\">launch video<\/a>.<\/p>\n<p>The test-takers, whether human or AI, are not given instructions on how to play an environment. To make progress they must:<\/p>\n<ul>\n<li>Explore unfamiliar interfaces<\/li>\n<li>Infer rules from sparse feedback (aka world model)<\/li>\n<li>Form &amp; test hypotheses<\/li>\n<li>Recover from wrong assumptions<\/li>\n<li>Transfer what they learned from one level to the next (aka continual learning)<\/li>\n<\/ul>\n<p>Each environment is built without the cultural knowledge a model would usually lean on. This means the environments isolate abstract reasoning.<\/p>\n<p>You can think of ARC-AGI-3 as the lowest common denominator across novelty, ambiguity, planning, and adaptation. These are the same demands real-world tasks make of agents.<\/p>\n<p>ARC-AGI-3 Public Demo environments: <a href=\"https:\/\/arcprize.org\/tasks\/ar25\" target=\"_blank\" rel=\"noopener noreferrer\">ar25<\/a>, <a href=\"https:\/\/arcprize.org\/tasks\/lf52\" target=\"_blank\" rel=\"noopener noreferrer\">lf52<\/a>, <a href=\"https:\/\/arcprize.org\/tasks\/sb26\" target=\"_blank\" rel=\"noopener noreferrer\">sb26<\/a><br \/>\nFailure modes on ARC-AGI-3<\/p>\n<p>ARC-AGI-3 was built with testing and model auditing in mind. Each AI run is recorded along with its reasoning traces. We\u2019ve had <a href=\"https:\/\/x.com\/arcprize\/status\/2049878321576743263?s=20\" target=\"_blank\" rel=\"noopener noreferrer\">over 1,000,000 games<\/a> played on ARC-AGI-3 so far.<\/p>\n<p>For this analysis we:<\/p>\n<ul>\n<li>Download all the logs\/reasoning\/steps from every public game run for GPT-5.5 and Opus 4.7<\/li>\n<li>Write a strategy for each game to serve as our ground truth answer<\/li>\n<li>Ask Codex\/Claude Code to analyze the reasoning steps against the level strategy to find failure modes<\/li>\n<li>Do a meta-analysis across games for a single model. Repeat the meta-analysis across models.<\/li>\n<li>Validate findings, by hand, with a human<\/li>\n<\/ul>\n<p>This led us to discover why each model passed\/failed, which failure modes were shared and which were unique to each model.<\/p>\n<p>Failure mode 1: true local effect, false world model<\/p>\n<p>The first failure mode we saw was the most dominant pattern. Models were able to perceive a local effect:<\/p>\n<blockquote class=\"arc-quote\"><p>When I press ACTION3, this object rotates.<\/p><\/blockquote>\n<p>but they weren\u2019t able to translate that into a world model:<\/p>\n<blockquote class=\"arc-quote\"><p>ACTION3 rotates the object, and rotation controls which side gets a new value, so I should orient the object to match the target before acting.<\/p><\/blockquote>\n<p>Put another way, the models don\u2019t fail because they don\u2019t observe anything&#8230;they fail because they can&#8217;t anchor their observation in a world model.<\/p>\n<p>For example, Opus, when playing cd82, knew that ACTION3 rotated the container by <a href=\"https:\/\/arcprize.org\/replay\/67ca6a82-fc59-4bcb-8646-e66edd66d2eb?frame=4&amp;quote=rotated&amp;quoteFrame=4&amp;quotePrefix=Interesting%21+ACTION3+&amp;quoteSuffix=+the+middle-bottom+15+container+into+a+diamond+shape.+Let+me+try+ACTION4.%0A%0AACTIO&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 4<\/a>, and by <a href=\"https:\/\/arcprize.org\/replay\/67ca6a82-fc59-4bcb-8646-e66edd66d2eb?frame=6&amp;quote=ACTION5+is+pouring+15s+from+the+bottom-right+container+into+the+0+container&amp;quoteFrame=6&amp;quotePrefix=I+see+-+&amp;quoteSuffix=+below+it.+The+15s+fall+down.+Let+me+continue+the+flow+and+see.%0A%0AACTION5&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 6<\/a> it saw ACTION5 pour\/dip paint but it never converted that into &#8220;orient bucket, then dip to recreate the top-left target.&#8221;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/Opus-4-7-cd82.png\" alt=\"Opus 4.7 playing CD82\" style=\"max-width:105%;width:105%;position:relative;left:50%;transform:translateX(-50%);display:block\"\/>Opus 4.7 understood that ACTION3 rotates the objects but failed to grasp the game concepts. <br \/><a href=\"https:\/\/arcprize.org\/replay\/67ca6a82-fc59-4bcb-8646-e66edd66d2eb\" target=\"_blank\" rel=\"noopener noreferrer\">Opus 4.7 Playing cd82<\/a>. Game Score: 0%<\/p>\n<p>Or in cn04, Opus found a successful rotate-then-place interaction (this is the correct hypothesis, <a href=\"https:\/\/arcprize.org\/replay\/c99b68f5-0d61-46ff-a7b3-3491ef886f36?frame=23&amp;quote=Let+me+move%2Frotate+to+find+where+to+place+this+new+shape.&amp;quoteFrame=23&amp;quotePrefix=ertical+with+8+markers+on+left%29.+The+original+source+location+is+now+marked+12.+&amp;quoteSuffix=%0A%0AACTION3&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 23<\/a>) but optimized for whole-shape overlap (incorrect) and fake top-row progress (<a href=\"https:\/\/arcprize.org\/replay\/c99b68f5-0d61-46ff-a7b3-3491ef886f36?frame=60&amp;quote=Looking+at+the+progress%3A+the+0-bar+is+at+16%2F32+cells+%2850%25%29&amp;quoteFrame=60&amp;quoteSuffix=.+I+need+to+convert+the+bottom+Y-shape+%2814-colored%29+to+0.+Let+me+click+firmly+on&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 60<\/a>).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/opus-4-7-cn04.png\" alt=\"Opus 4.7 playing CN04\" style=\"max-width:105%;width:105%;position:relative;left:50%;transform:translateX(-50%);display:block\"\/><a href=\"https:\/\/arcprize.org\/replay\/c99b68f5-0d61-46ff-a7b3-3491ef886f36\" target=\"_blank\" rel=\"noopener noreferrer\">Opus 4.7 Playing cn04<\/a>. Game Score: 0%<br \/>\nFailure mode 2: wrong level of abstraction from training data<\/p>\n<p>The second failure mode came from an incorrect level of abstraction from a model\u2019s training data. Across the runs, the models repeatedly explained unfamiliar mechanics by mapping them to known games: Tetris, <a href=\"https:\/\/arcprize.org\/replay\/3beb8811-6f2a-4ce1-a1b6-bf53d1b4ad1d?frame=477&amp;quote=-+A+grid+with+paths+%280%2C0%2C0%29+and+stepping+stones+%282%2C2%2C2%29+forming+a+Frogger-like+puzzle&amp;quoteFrame=477&amp;quotePrefix=+rows+27-29%2C+cols+27-29%0A-+A+target+%2814%2C14%2C14+pattern%29+at+rows+46-48%2C+cols+45-47%0A&amp;quoteSuffix=%0A-+The+player+needs+to+navigate+to+the+target%0A%0AAcross+the+frames%2C+the+player+app&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">Frogger<\/a>, <a href=\"https:\/\/arcprize.org\/replay\/70ecc44d-2fda-4c15-8e3f-c9d29e067943?frame=1028&amp;quote=this+appears+to+be+a+sokoban-like+puzzle+where+I+need+to+move+the+9-shape+through+the+maze.&amp;quoteFrame=1028&amp;quotePrefix=ze+%28from+rows+14-18+to+rows+8-12%29.+The+indicator+in+the+top-left+also+changed+-+&amp;quoteSuffix=%0A%0ALet+me+continue+trying+to+navigate.+I%27ll+try+ACTION5+again+to+see+if+it+contin&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">Sokoban<\/a>, Powder Toy, Flood-It, MiniGrid, CoinRun, Breakout, Pong, Boulder Dash, and others. While recalling abstractions from core prior knowledge is helpful in theory, the literal analogies from the model\u2019s training data hijacked action selection.<\/p>\n<p>The problem is that a local visual resemblance becomes a full gameplay theory, then the model wastes actions testing the wrong affordances.<\/p>\n<p>In cd82, GPT-5.5 anchored on sand\/physics\/Flood-It mechanics. ls20 became Breakout instead of key combinations.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/gpt-5-5-ls20.png\" alt=\"GPT-5.5 playing LS20\" style=\"max-width:105%;width:105%;position:relative;left:50%;transform:translateX(-50%);display:block\"\/>GPT-5.5 mistakes ls20 for other games. <a href=\"https:\/\/arcprize.org\/replay\/672f4ae0-c481-4e10-aa35-55a71e412ec5?frame=11\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-5.5 Playing ls20<\/a>. Game Score: 0%<br \/>\nFailure mode 3: solved the level, didn\u2019t learn the game<\/p>\n<p>The final failure mode we saw was that even if a model beat a level, that reward did not translate into further success. This shows us that beating a level is not the same as understanding it.<\/p>\n<p>Two Opus runs make this especially clear. On ka59, Opus <a href=\"https:\/\/arcprize.org\/replay\/1dfbb619-2f6a-484a-89c8-bbe64abc977b?frame=43\" target=\"_blank\" rel=\"noopener noreferrer\">solved Level 1 in 37 actions<\/a>, but its working theory of the click (teleporting the active character) was wrong. Although it looked like a clean win, this was a coincidence between a misread primitive and a forgiving level.<\/p>\n<p>When Level 2 demanded the real mechanic (shape-matching and pushing), Opus\u2019s mislabel theory hardened into &#8220;<a href=\"https:\/\/arcprize.org\/replay\/1dfbb619-2f6a-484a-89c8-bbe64abc977b?frame=251&amp;quote=Let+me+click+on+the+4s+in+the+box+at+rows+47-48%2C+cols+44-45&amp;quoteFrame=251&amp;quotePrefix=Looking+at+remaining+4s+inside+14-bordered+boxes.+&amp;quoteSuffix=.%0A%0AACTION6+44+47&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">click each target to fill it<\/a>,&#8221; and the run never recovered.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/ka59_opus.gif\" alt=\"Opus 4.7 playing KA59\" style=\"max-width:75%;margin:0 auto;display:block\"\/>Opus 4.7 gets stuck in a click-fishing loop. <a href=\"https:\/\/arcprize.org\/replay\/1dfbb619-2f6a-484a-89c8-bbe64abc977b\" target=\"_blank\" rel=\"noopener noreferrer\">Opus 4.7 playing ka59<\/a>. Game Score: 2.04%<\/p>\n<p>ar25 shows the same pattern at a different abstraction level. Opus cleared Level 1 with a correct read of mirrored movement (<a href=\"https:\/\/arcprize.org\/replay\/7752f433-a3a4-4026-ae8c-6c32b4afcf9d?frame=4&amp;quote=They+mirror+around+the+10-divider&amp;quoteFrame=4&amp;quotePrefix=s%29+at+specific+positions%0A-+The+4-shape+is+the+%22complete%22+version+%28no+holes%29++%0A-+&amp;quoteSuffix=%0A-+The+right+column+fills+with+5s+each+action+%28maybe+action+counter%29%0A-+ACTION6+t&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 4<\/a>), then in Level 2 actually discovered the new movable-axis mechanic (<a href=\"https:\/\/arcprize.org\/replay\/7752f433-a3a4-4026-ae8c-6c32b4afcf9d?frame=227&amp;quote=ACTION3+moved+the+10-column+left+by+3%21&amp;quoteFrame=227&amp;quoteSuffix=+Let+me+continue+with+ACTION3%3A%0A%0AACTION3&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 227<\/a>), but it still drifted into hallucinated rules of punching holes and needing a flip.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2026\/05\/opus-4-7-ar25.png\" alt=\"Opus 4.7 playing AR25\" style=\"max-width:105%;width:105%;position:relative;left:50%;transform:translateX(-50%);display:block\"\/><a href=\"https:\/\/arcprize.org\/replay\/7752f433-a3a4-4026-ae8c-6c32b4afcf9d\" target=\"_blank\" rel=\"noopener noreferrer\">Opus 4.7 Playing ar25<\/a>. Game Score: 0.15%<\/p>\n<p>In both cases the Level 1 success masked a missing or distorted primitive, and the partial win became a confident scaffold for the wrong Level 2 strategy.<\/p>\n<p>This also shows us that early level progression can be a noisy signal of comprehension. Without an explicit check on why the prior level was won, models will carry their misconception into the next level.<\/p>\n<p>Opus 4.7: wrong compression, GPT-5.5: failure to compress<\/p>\n<p>As we compare the runs of GPT-5.5 and Opus 4.7 we\u2019re able to see they failed in different ways. This is important because <strong>aggregate scores alone would hide this distinction<\/strong>.<\/p>\n<p>Opus had the wrong compression, GPT-5.5 failed to compress.<\/p>\n<p>Opus 4.7 is stronger at short-horizon mechanic discovery. On ar25 it identifies the mirror structure almost immediately and clears Level 1. On ka59 it reads the two-character, two-target layout and executes the short Level 1 sequence even with an incomplete world model.<\/p>\n<p>The flipside is that Opus is also more likely to latch onto a false invariant and execute it aggressively. On cn04 it leans into a fake progress\/timer\/conversion theory and spends the opener click-fishing inside that story (<a href=\"https:\/\/arcprize.org\/replay\/c99b68f5-0d61-46ff-a7b3-3491ef886f36?frame=60&amp;quote=Let+me+click+firmly+on+its+body&amp;quoteFrame=60&amp;quotePrefix=s+at+16%2F32+cells+%2850%25%29.+I+need+to+convert+the+bottom+Y-shape+%2814-colored%29+to+0.+&amp;quoteSuffix=.%0A%0AACTION6+48+27&amp;reasoning=decision\" target=\"_blank\" rel=\"noopener noreferrer\">step 60<\/a>). It does form a working theory, it&#8217;s just the wrong one.<\/p>\n<p>GPT-5.5 sits at the other end. Its hypothesis generation is wider, which makes it more likely to articulate the right idea but less likely to turn it into a plan. On ar25 it names the mirror effect but keeps reopening the genre space, drifting through Tetris, Frogger, Pong, and Tower of Hanoi instead of committing to reflection. On ka59 it reaches the correct object ontology, two target outlines and a switchable second character, but never commits to it.<\/p>\n<p>The difference comes down to compression. Opus compressed its observations into a confident-but-wrong theory. GPT-5.5 had difficulty compressing at all.<\/p>\n<p>ARC-AGI-3 measures agent autonomy<\/p>\n<p>Each ARC-AGI-3 environment has been <a href=\"https:\/\/arcprize.org\/blog\/arc-agi-3-human-dataset\" target=\"_blank\" rel=\"noopener noreferrer\">solved by at least two humans<\/a> without special training. What makes them hard for agents is that they require something closer to real intelligence through encountering an unfamiliar environment, forming a working theory, testing it, updating it when the evidence disagrees, and carrying forward what was learned.<\/p>\n<p>That same level of meta-learning is also what real-world agents will need.<\/p>\n<p>Real-world agents will not just operate inside clean benchmark prompts or memorized task templates. They will face unfamiliar websites, internal tools, dashboards, forms, APIs, workflows, and edge cases that were not described in advance. In those settings, failure will often look exactly like the failure modes in ARC-AGI-3.<\/p>\n<p>That&#8217;s why ARC Prize Foundation will continue auditing every major frontier release. Scores tell you what a model achieved. Replays tell you whether or not the reasoning is likely to generalize.<\/p>\n<p>If you\u2019d like to help us audit the frontier of AI with ARC-AGI-3, come <a href=\"https:\/\/arcprize.org\/jobs\" target=\"_blank\" rel=\"noopener noreferrer\">join the team<\/a>.<\/p>\n<p>Analysis notes<\/p>\n<ul>\n<li>With our default tests, GPT-5.5 did not return its reasoning traces. The qualitative analysis above therefore uses an analysis mode run, which is an alternate testing setup intended to elicit more explicit descriptions of how the model is interpreting the environment and deciding what to try next. The official GPT-5.5 score is not from that mode. All reported scores use the standard ARC-AGI-3 harness, matching the setup used for Opus 4.7.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed.&hellip;\n","protected":false},"author":2,"featured_media":931531,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-931530","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/116501527251766760","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/931530","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=931530"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/931530\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/931531"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=931530"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=931530"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=931530"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}