Search-capable AI agents may cheat on benchmark tests • The Register

Researchers with Scale AI have found that search-based AI models may cheat on benchmark tests by fetching the answers directly from online sources rather than deriving those answers through a “reasoning” process.

Scale AI computer scientists Ziwen Han, Meher Mankikar, Julian Michael, and Zifan Wang refer to the phenomenon as “Search-Time Data Contamination,” which they describe in a paper published to the AI data provider’s website.

On their own, AI models suffer from a significant limitation: They’re trained at a specific point in time on a limited set of data and thus lack information about anything after that training data cut-off date.

So to better handle inquiries about current events, firms like Anthropic, Google, OpenAI, and Perplexity have integrated search capabilities into their AI models, giving them access to recent online information.

The Scale AI researchers looked specifically at Perplexity’s agents – Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research – to see how often the AI agents when undergoing a capability evaluation accessed relevant benchmark tests and answers from HuggingFace, an online repository for AI models and related matters like benchmarks.

“On three commonly used capability benchmarks – Humanity’s Last Exam (HLE), SimpleQA, and GPQA – we demonstrate that for approximately 3 percent of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace,” the authors state in their paper.

This is search-time contamination (STC) – when a search-based LLM is being evaluated and its search-retrieval process provides clues about the answer to the evaluation question.

When Perplexity agents were denied access to HuggingFace, their accuracy on the contaminated subset of benchmark questions dropped by about 15 percent. What’s more, Scale AI researchers note that further experiments suggest HuggingFace may not be the only source of STC for the tested models.

The authors say that while 3 percent may only seem significant for frontier model benchmarks like HLE, where just a 1 percent change in a model’s overall score can affect its ranking, it’s more important to realize that the findings call into question all evaluations done where models have online access, and undermine the integrity of AI benchmarks more broadly.

But AI benchmarks may not have much integrity to begin with. As we reported previously, AI benchmarks suck. They may be poorly designed, biased, contaminated, or gamed.

A recent survey of 283 AI benchmarks by researchers in China echoes this assessment: “current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.” ®

Search-capable AI agents may cheat on benchmark tests • The Register

Tags: