{"id":8874,"date":"2025-04-10T19:55:10","date_gmt":"2025-04-10T19:55:10","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/8874\/"},"modified":"2025-04-10T19:55:10","modified_gmt":"2025-04-10T19:55:10","slug":"the-rise-of-ai-reasoning-models-is-making-benchmarking-more-expensive","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/8874\/","title":{"rendered":"The rise of AI &#8216;reasoning&#8217; models is making benchmarking more expensive"},"content":{"rendered":"<p id=\"speakable-summary\" class=\"wp-block-paragraph\">AI labs like OpenAI claim that their <a href=\"https:\/\/techcrunch.com\/2024\/11\/20\/ai-scaling-laws-are-showing-diminishing-returns-forcing-ai-labs-to-change-course\/\" target=\"_blank\" rel=\"noopener\">so-called \u201creasoning\u201d AI models<\/a>, which can \u201cthink\u201d through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.<\/p>\n<p class=\"wp-block-paragraph\">According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2025\/03\/19\/openais-o1-pro-is-its-most-expensive-model-yet\/\" target=\"_blank\" rel=\"noopener\">o1<\/a> reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity\u2019s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.<\/p>\n<p class=\"wp-block-paragraph\">Benchmarking Anthropic\u2019s recent <a href=\"https:\/\/techcrunch.com\/2025\/02\/24\/anthropic-launches-a-new-ai-model-that-thinks-as-long-as-you-want\/\" target=\"_blank\" rel=\"noopener\">Claude 3.7 Sonnet<\/a>, a \u201chybrid\u201d reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2025\/01\/31\/openai-launches-o3-mini-its-latest-reasoning-model\/\" target=\"_blank\" rel=\"noopener\">o3-mini-high<\/a> cost $344.59, per Artificial Analysis.<\/p>\n<p class=\"wp-block-paragraph\">Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI\u2019s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).<\/p>\n<p class=\"wp-block-paragraph\">OpenAI\u2019s non-reasoning <a href=\"https:\/\/techcrunch.com\/2024\/05\/13\/openais-newest-model-is-gpt-4o\/\" target=\"_blank\" rel=\"noopener\">GPT-4o<\/a> model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet \u2014 Claude 3.7 Sonnet\u2019s non-reasoning predecessor \u2014 cost $81.41.<\/p>\n<p class=\"wp-block-paragraph\">Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models. <\/p>\n<p class=\"wp-block-paragraph\">\u201cAt Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,\u201d Cameron said. \u201cWe are planning for this spend to increase as models are more frequently released.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Artificial Analysis isn\u2019t the only outfit of its kind that\u2019s dealing with rising AI benchmarking costs.<\/p>\n<p class=\"wp-block-paragraph\">Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model\u2019s language comprehension skills, would have cost more than $1,800.<\/p>\n<p class=\"wp-block-paragraph\">\u201cWe\u2019re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are recent post on X. \u201c[N]o one is going to be able to reproduce the results.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word \u201cfantastic\u201d split into the syllables \u201cfan,\u201d \u201ctas,\u201d and \u201ctic.\u201d According to Artificial Analysis, OpenAI\u2019s o1 generated over 44 million tokens during the firm\u2019s benchmarking tests, around eight times the amount GPT-4o generated.<\/p>\n<p class=\"wp-block-paragraph\">The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.<\/p>\n<p class=\"wp-block-paragraph\">Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.<\/p>\n<p class=\"wp-block-paragraph\">\u201c[Today\u2019s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,\u201d Denain told TechCrunch. \u201cThey often attempt to evaluate models\u2019 ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/03\/07\/we-tested-anthropics-new-chatbot-and-came-away-a-bit-disappointed\/\" target=\"_blank\" rel=\"noopener\">Claude 3 Opus<\/a> was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2025\/02\/27\/openai-unveils-gpt-4-5-orion-its-largest-ai-model-yet\/\" target=\"_blank\" rel=\"noopener\">GPT-4.5<\/a> and <a href=\"https:\/\/techcrunch.com\/2025\/03\/19\/openais-o1-pro-is-its-most-expensive-model-yet\/\" target=\"_blank\" rel=\"noopener\">o1-pro<\/a>, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.<\/p>\n<p class=\"wp-block-paragraph\">\u201c[S]ince models have gotten better over time, it\u2019s still true that the cost to reach a given level of performance has greatly decreased over time,\u201d Denain said. \u201cBut if you want to evaluate the best largest models at any point in time, you\u2019re still paying more.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say \u2014 even if there\u2019s no evidence of manipulation, the mere suggestion of an AI lab\u2019s involvement threatens to harm the integrity of the evaluation scoring.<\/p>\n<p class=\"wp-block-paragraph\">\u201cFrom [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?\u201d wrote Taylor in a <a href=\"https:\/\/x.com\/rosstaylor90\/status\/1908490444176052538\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">follow-up post on X<\/a>. \u201c(Was it ever science, lol)\u201d.<\/p>\n","protected":false},"excerpt":{"rendered":"AI labs like OpenAI claim that their so-called \u201creasoning\u201d AI models, which can \u201cthink\u201d through problems step by&hellip;\n","protected":false},"author":2,"featured_media":8875,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,5889,5890,1942,53,16,15],"class_list":{"0":"post-8874","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-ai-benchmarks","10":"tag-ai-reasoning-models","11":"tag-artificial-intelligence","12":"tag-technology","13":"tag-uk","14":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114315429337113277","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/8874","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=8874"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/8874\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/8875"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=8874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=8874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=8874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}