{"id":208050,"date":"2025-11-30T13:12:12","date_gmt":"2025-11-30T13:12:12","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/208050\/"},"modified":"2025-11-30T13:12:12","modified_gmt":"2025-11-30T13:12:12","slug":"the-arc-benchmarks-fall-marks-another-casualty-of-relentless-ai-optimization","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/208050\/","title":{"rendered":"The ARC benchmark&#8217;s fall marks another casualty of relentless AI optimization"},"content":{"rendered":"<p>                                    <a class=\"article-menu__content__link\" href=\"#summary\"><br \/>\n                        <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/the-decoder.com\/resources\/icons\/summary.svg\" alt=\"summary\" width=\"27\" height=\"24\" data-no-lazy=\"1\"\/><br \/>\n                        Summary<br \/>\n                    <\/a><\/p>\n<p><strong>For years, the ARC benchmark was considered a nearly insurmountable obstacle for AI systems, a true test of fluid intelligence rather than simple memorization. But new results show that even this barrier is crumbling under the relentless optimization machinery of modern AI labs.<\/strong><\/p>\n<p>The &#8220;Abstraction and Reasoning Corpus&#8221;\u2014later renamed ARC-AGI\u2014was originally designed to separate true learning from statistical parroting. Now, it faces the same fate as many benchmarks before it: newer methods are simply overpowering it.<\/p>\n<p>New results from AI company Poetiq suggest the original ARC-AGI-1 benchmark is effectively solved. In a recent <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/poetiq.ai\/posts\/arcagi_announcement\/\">announcement<\/a>, the company claims its systems, built on models like OpenAI&#8217;s and Google&#8217;s, have maxed out performance on the first dataset. More notably, the system reportedly beat the human average of 60 percent on the significantly harder ARC-AGI-2 dataset.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/Poetiq-arcagi1-scaled.png\"><img data-lazyloaded=\"1\" fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-48118 size-large\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/Poetiq-arcagi1-scaled.png\" alt=\"Poetiq's results indicate that the original ARC-AGI-1 benchmark has been largely solved, while performance on the harder ARC-AGI-2 dataset now exceeds human averages. | Image: Poetiq\" width=\"1200\" height=\"840\"\/><\/a>Poetiq&#8217;s results indicate that the original ARC-AGI-1 benchmark has been largely solved, while performance on the harder ARC-AGI-2 dataset now exceeds human averages. | Image: Poetiq<\/p>\n<p>Share<\/p>\n<p>Recommend our article<\/p>\n<p>        Share<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/Poetiq-arcagi2-770x538.png\"><img loading=\"lazy\" data-lazyloaded=\"1\" decoding=\"async\" class=\"wp-image-48119 size-large\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/Poetiq-arcagi2-770x538.png\" alt=\"\" width=\"1200\" height=\"839\"\/><\/a>Image: Poetiq<\/p>\n<p>Poetiq\u2019s approach combines advanced language models, including Gemini 3 and GPT-5.1, with open-source models integrated into a custom architecture. According to <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/poetiq.ai\/posts\/arcagi_announcement\/\">Poetiq<\/a>, the system operates in an iterative loop: it generates proposed solutions, evaluates feedback, and refines answers through a self-audit before finalizing the result.<\/p>\n<p>Ad<\/p>\n<p>THE DECODER Newsletter<\/p>\n<p>The most important AI news straight to your inbox.<\/p>\n<p>\u2713 Weekly<\/p>\n<p>\u2713 Free<\/p>\n<p>\u2713 Cancel at any time<\/p>\n<p>Specialized models turn abstraction into an optimization problem<\/p>\n<p>When AI researcher and Keras creator Fran\u00e7ois Chollet introduced ARC in 2019, he pitched it as an antidote to the deep learning paradigm. The goal was to measure &#8220;skill acquisition efficiency&#8221;\u2014how well a system learns new tasks\u2014rather than how much data it could memorize.<\/p>\n<p>Researchers struggled with these colorful grid puzzles for years. While language models crushed other benchmarks, ARC success rates remained low. For some, it became the &#8220;North Star&#8221; of AGI research; for others, it highlighted the limitations of scaling large models.<\/p>\n<p>That dynamic shifted with the arrival of specialized reasoning models and techniques like Test-Time Training (TTT). A major turning point occurred in December 2024, when OpenAI&#8217;s o3-preview suddenly scored over 75 percent on ARC-AGI-1. What began as a test of human-like abstraction is fast becoming an optimization target for reinforcement learning and search algorithms. Labs are now tuning their systems to master ARC&#8217;s specific logic.<\/p>\n<p>Efficiency is improving alongside performance. According to Poetiq, its &#8220;Poetiq (GPT-OSS-b)&#8221; system, based on the open model <a href=\"https:\/\/the-decoder.com\/openai-releases-its-first-open-weight-language-models-since-gpt-2-with-gpt-oss\/\" rel=\"nofollow noopener\" target=\"_blank\">GPT-OSS-120B<\/a>, achieves over 40 percent accuracy on ARC-AGI-1 for less than a cent per task. The era of ARC solutions requiring massive compute appears to be ending, a trend further supported by the non-LLM &#8220;Tiny Recursive Model.&#8221;<\/p>\n<p>Performance drops suggest models are still memorizing public data<\/p>\n<p>These high scores currently apply only to &#8220;public&#8221; datasets, not the &#8220;semi-private&#8221; sets held back by ARC administrators. In its own analysis, Poetiq notes that many underlying LLMs perform significantly worse when switching from public evaluation sets to semi-private ones.<\/p>\n<p>Recommendation<\/p>\n<p>                                            <a class=\"link-overlay\" href=\"https:\/\/the-decoder.com\/apples-claims-about-large-reasoning-models-face-fresh-scrutiny-from-a-new-study\/\" aria-label=\"Apple&#039;s claims about large reasoning models face fresh scrutiny from a new study\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>                                                        \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/apples-claims-about-large-reasoning-models-face-fresh-scrutiny-from-a-new-study\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/graph_going_down_illustration.png\" loading=\"lazy\" alt=\"Apple's claims about large reasoning models face fresh scrutiny from a new study\" width=\"375\" height=\"250\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>                \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/apples-claims-about-large-reasoning-models-face-fresh-scrutiny-from-a-new-study\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/11\/graph_going_down_illustration.png\" loading=\"lazy\" alt=\"Apple's claims about large reasoning models face fresh scrutiny from a new study\" width=\"375\" height=\"250\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>The culprit is likely &#8220;data contamination&#8221;: public benchmarks often end up in the training data for large models. True generalization is only proven on tasks a model has definitely never seen. Poetiq expects its own systems to see a similar performance dip on ARC-AGI-1 for this reason.<\/p>\n<p>However, the newer ARC-AGI-2 might be more resistant to this effect. Poetiq describes the sets as &#8220;more tightly calibrated&#8221; and claims its system was never trained on ARC-AGI-2 tasks, although the foundation models it uses, might be.<\/p>\n<p>The industry shifts focus toward test-time adaptation<\/p>\n<p>Chollet has watched this evolution closely. He views recent successes as evidence of a fundamental strategic shift in AI development.<\/p>\n<p>Describing results from reasoning models like <a href=\"https:\/\/the-decoder.com\/openai-unveils-o3-its-most-advanced-reasoning-model-yet\/\" rel=\"nofollow noopener\" target=\"_blank\">o3 as a &#8220;a surprising and important step-function increase in AI capabilities,&#8221;<\/a> Chollet argues that the old strategy of scaling intelligence via larger models and more data is hitting a wall with tasks like ARC. Instead, <a href=\"https:\/\/the-decoder.com\/francois-chollet-on-the-end-of-scaling-arc-3-and-his-path-to-agi\/\" rel=\"nofollow noopener\" target=\"_blank\">the field has entered the era of test-time adaptation<\/a>.<\/p>\n<p>Models are no longer static responders. They adapt at runtime, using techniques similar to program synthesis and chain-of-thought reasoning to reconfigure themselves for specific problems. For Chollet, this validates his theory that intelligence is a process of adaptation, not a static knowledge warehouse.<\/p>\n<p>He maintains that solving ARC is a necessary step toward AGI, but not AGI itself. Current models still fail basic tasks and lack a profound understanding of the world. The benchmark&#8217;s purpose was to push research toward better systems. And it worked.<\/p>\n<p>The industry responded, though perhaps more pragmatically than cognitive scientists hoped. Instead of &#8220;general intelligence,&#8221; we got specialized reasoning machines that tackle puzzles through iterative loops and code generation.<\/p>\n<p>With ARC-AGI-1 effectively saturated, even the tougher ARC-AGI-2 is now falling. Poetiq&#8217;s system beat the human average despite <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/poetiq.ai\/posts\/arcagi_announcement\/\">never training on those specific tasks<\/a>.<\/p>\n<p>Solving the benchmark proves its value as a catalyst<\/p>\n<p>ARC-AGI is experiencing the typical lifecycle of a benchmark: it becomes a metric for marketing departments. Once a target is defined and incentives exist, like the ARC Prize&#8217;s million-dollar purse, labs will optimize until they hit the number.<\/p>\n<p>This doesn&#8217;t mean AI is thinking like a human. It demonstrates the adaptability of modern AI research, which can hit almost any abstract target by combining compute, synthetic data, and sophisticated search methods.<\/p>\n<p>ARC-AGI-1 and ARC-AGI-2 succeeded by forcing a focus on reasoning and adaptation. That they are now being &#8220;solved&#8221; isn&#8217;t a failure of the test but proof of its effectiveness in driving development. It remains to be seen whether these methods lead to true fluid intelligence. Many people, including Chollet, believe that something is still missing.<\/p>\n<p>He is already looking ahead to ARC-AGI-3, <a href=\"https:\/\/the-decoder.com\/richard-sutton-says-the-ai-industry-has-lost-its-way-by-ignoring-core-principles-of-intelligence\/\" rel=\"nofollow noopener\" target=\"_blank\">which will use interactive environments to test model &#8220;agency&#8221;\u2014the ability to act<\/a>.<\/p>\n<p>Poetiq has released its code and results on <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/github.com\/poetiq-ai\/poetiq-arc-agi-solver?tab=readme-ov-file\">GitHub<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"Summary For years, the ARC benchmark was considered a nearly insurmountable obstacle for AI systems, a true test&hellip;\n","protected":false},"author":2,"featured_media":208051,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[4739,1645,16950,114088,18,19,17,82],"class_list":{"0":"post-208050","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-benchmark","9":"tag-agi","10":"tag-ai-research","11":"tag-arc-agi","12":"tag-eire","13":"tag-ie","14":"tag-ireland","15":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/115638825744845086","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/208050","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=208050"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/208050\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/208051"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=208050"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=208050"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=208050"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}