{"id":6581,"date":"2025-04-10T02:22:14","date_gmt":"2025-04-10T02:22:14","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/6581\/"},"modified":"2025-04-10T02:22:14","modified_gmt":"2025-04-10T02:22:14","slug":"meta-gets-caught-gaming-ai-benchmarks-with-llama-4","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/6581\/","title":{"rendered":"Meta gets caught gaming AI benchmarks with Llama 4"},"content":{"rendered":"<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Over the weekend, Meta dropped two new <a href=\"https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/\" target=\"_blank\" rel=\"noopener\">Llama 4 models<\/a>: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash \u201cacross a broad range of widely reported benchmarks.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Meta\u2019s <a href=\"https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/\" target=\"_blank\" rel=\"noopener\">press release<\/a>, the company highlighted Maverick\u2019s ELO score of 1417, which placed it above OpenAI\u2019s 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">The achievement seemed to position Meta\u2019s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta\u2019s documentation discovered something unusual.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn\u2019t the same as what\u2019s available to the public. According to Meta\u2019s own materials, it deployed an <a href=\"https:\/\/x.com\/natolambert\/status\/1908913635373842655\">\u201cexperimental chat version\u201d<\/a> of Maverick to LMArena that was specifically \u201coptimized for conversationality,\u201d TechCrunch first <a href=\"https:\/\/techcrunch.com\/2025\/04\/06\/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading\/\" target=\"_blank\" rel=\"noopener\">reported<\/a>.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cMeta\u2019s interpretation of our policy did not match what we expect from model providers,\u201d LMArena <a href=\"https:\/\/x.com\/lmarena_ai\/status\/1909397817434816562\">posted<\/a> on X two days after the model\u2019s release. \u201cMeta should have made it clearer that \u2018Llama-4-Maverick-03-26-Experimental\u2019 was a customized model to optimize for human preference. As a result of that, we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn\u2019t occur in the future.\u201c<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">A spokesperson for Meta, Ashley Gabriel, said in an emailed statement that \u201cwe experiment with all types of custom variants.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201c\u2018Llama-4-Maverick-03-26-Experimental\u2019 is a chat optimized version we experimented with that also performs well on LMArena,\u201d Gabriel said. \u201cWe have now released our open source version and will see how developers customize Llama 4 for their own use cases. We\u2019re excited to see what they will build and look forward to their ongoing feedback.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">While what Meta did with Maverick isn\u2019t explicitly against LMArena\u2019s rules, the site has shared concerns <a href=\"https:\/\/blog.lmarena.ai\/blog\/2024\/policy\/?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\">about gaming the system<\/a> and taken steps to \u201cprevent overfitting and benchmark leakage.\u201d When companies can submit specially-tuned versions of their models for testing while releasing different versions to the public, benchmark rankings like LMArena become less meaningful as indicators of real-world performance.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201dIt\u2019s the most widely respected general benchmark because all of the other ones suck,\u201d independent AI researcher Simon Willison tells The Verge. \u201cWhen Llama 4 came out, the fact that it came second in the arena, just after Gemini 2.5 Pro \u2014 that really impressed me, and I\u2019m kicking myself for not reading the small print.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Shortly after Meta released Maverick and Scout, the AI community started <a href=\"https:\/\/x.com\/Yuchenj_UW\/status\/1909061004207816960\">talking about a rumor<\/a> that Meta had also trained its Llama 4 models to perform better on benchmarks while hiding their real limitations. VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations <a href=\"https:\/\/x.com\/Ahmad_Al_Dahle\/status\/1909302532306092107\">in a post on X<\/a>: \u201cWe\u2019ve also heard claims that we trained on test sets &#8212; that\u2019s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup qnnwq2 _1xwtict9\">\u201cIt\u2019s a very confusing release generally.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Some <a href=\"https:\/\/x.com\/kalomaze\/status\/1908706389922324599\">also noticed<\/a> that Llama 4 was released at an odd time. Saturday doesn\u2019t tend to be when big AI news drops. After someone on Threads asked why Llama 4 was released over the weekend, Meta CEO Mark Zuckerberg <a href=\"https:\/\/www.threads.net\/@zuck\/post\/DIFAsupTS7Z\" target=\"_blank\" rel=\"noopener\">replied<\/a>: \u201cThat\u2019s when it was ready.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cIt\u2019s a very confusing release generally,\u201d says Willison, who <a href=\"https:\/\/simonwillison.net\/\" target=\"_blank\" rel=\"noopener\">closely follows and documents AI models<\/a>. \u201cThe model score that we got there is completely worthless to me. I can\u2019t even use the model that they got a high score on.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Meta\u2019s path to releasing Llama 4 wasn\u2019t exactly smooth. According <a href=\"https:\/\/www.theinformation.com\/articles\/meta-nears-release-new-ai-model-performance-hiccups?rc=mshudk\" target=\"_blank\" rel=\"noopener\">to a recent report<\/a> from The Information, the company repeatedly pushed back the launch due to the model failing to meet internal expectations. Those expectations are especially high after DeepSeek, an open-source AI startup from China, released an open-weight model that generated a ton of buzz.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Ultimately, using an optimized model in LMArena puts developers in a difficult position. When selecting models like Llama 4 for their applications, they naturally look to benchmarks for guidance. But as is the case for Maverick, those benchmarks can reflect capabilities that aren\u2019t actually available in the models that the public can access.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">As AI development accelerates, this episode shows how benchmarks are becoming battlegrounds. It also shows how Meta is eager to be seen as an AI leader, even if that means gaming the system.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\"><strong>Update, April 7th:<\/strong> The story was updated to add Meta\u2019s statement.<\/p>\n<p><a class=\"duet--article--comments-link b1p9679\" href=\"http:\/\/www.theverge.com\/meta\/645012\/meta-llama-4-maverick-benchmarks-gaming#comments\" target=\"_blank\" rel=\"noopener\"><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a&hellip;\n","protected":false},"author":2,"featured_media":6582,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,598,326,53,16,15],"class_list":{"0":"post-6581","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-meta","11":"tag-tech","12":"tag-technology","13":"tag-uk","14":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114311288759811753","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/6581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=6581"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/6581\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/6582"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=6581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=6581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=6581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}