{"id":211728,"date":"2025-09-09T02:50:10","date_gmt":"2025-09-09T02:50:10","guid":{"rendered":"https:\/\/www.europesays.com\/us\/211728\/"},"modified":"2025-09-09T02:50:10","modified_gmt":"2025-09-09T02:50:10","slug":"popular-ai-model-performance-benchmark-may-be-flawed-meta-researchers-warn","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/211728\/","title":{"rendered":"Popular AI model performance benchmark may be flawed, Meta researchers warn"},"content":{"rendered":"<p>A popular benchmark for measuring the performance of <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/artificial-intelligence?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">artificial intelligence<\/a> models could be flawed, a group of <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/meta-platforms?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Meta Platforms<\/a> researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.\u201cWe\u2019ve identified multiple loopholes with SWE-bench Verified,\u201d wrote Jacob Kahn, manager at Meta AI research lab <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/tech\/big-tech\/article\/3304853\/metas-ai-research-chief-exit-jolting-us65-billion-investment-drive?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Fair<\/a>, in a post last week on the developer platform <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/tech\/tech-trends\/article\/3214518\/microsofts-github-add-openai-chat-functions-coding-tool?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">GitHub<\/a>.The post from Fair, which stands for Fundamental AI Research, found several prominent AI models \u2013 including <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/anthropic?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Anthropic<\/a>\u2019s Claude and <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/alibaba-cloud?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Alibaba Cloud<\/a>\u2019s Qwen \u2013 had \u201ccheated\u201d on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/alibaba?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Alibaba Group Holding<\/a>, owner of the South China Morning Post.<a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/openai?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">OpenAI<\/a>-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/microsoft?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Microsoft<\/a> subsidiary.<\/p>\n<p datatype=\"p\" data-qa=\"Component-Component\" class=\"e8zc9q40 css-1c6uqr6 ec74h0k1\">Fair\u2019s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.<\/p>\n<p>The AI models found to have shown such behaviour included Anthropic\u2019s Claude 4 Sonnet, <a target=\"_self\" class=\"e1yy41x40 ef9u0v01 css-1ankfgb ecgc78b0\" href=\"https:\/\/www.scmp.com\/topics\/beijing-zhipu-huazhang-technology?module=inline&amp;pgtype=article\" title=\"\" data-qa=\"BaseLink-renderAnchor-StyledAnchor\" rel=\"nofollow noopener\">Z.ai<\/a>\u2019s GLM-4.5 and Alibaba Cloud\u2019s Qwen3-Coder-30B-A3B \u2013 with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.<\/p>\n<p datatype=\"p\" data-qa=\"Component-Component\" class=\"e8zc9q40 css-1c6uqr6 ec74h0k1\">\u201cWe\u2019re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,\u201d Kahn wrote.<\/p>\n","protected":false},"excerpt":{"rendered":"A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta&hellip;\n","protected":false},"author":3,"featured_media":211729,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[691,12462,22146,90637,116021,738,116020,116023,116017,31699,116018,116016,116022,7062,252,305,51949,116024,116019,158,67,132,68,94121],"class_list":{"0":"post-211728","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-ai-models","10":"tag-ai-systems","11":"tag-alibaba-cloud","12":"tag-anthropics-claude","13":"tag-artificial-intelligence","14":"tag-benchmark-saturation","15":"tag-carlos-jimenez","16":"tag-data-leakage","17":"tag-fair","18":"tag-github","19":"tag-hongshan-capital-group","20":"tag-jacob-kahn","21":"tag-meta-platforms","22":"tag-microsoft","23":"tag-openai","24":"tag-qwen","25":"tag-reward-hacking","26":"tag-swe-bench-verified","27":"tag-technology","28":"tag-united-states","29":"tag-unitedstates","30":"tag-us","31":"tag-z-ai"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/115172070032504428","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/211728","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=211728"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/211728\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/211729"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=211728"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=211728"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=211728"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}