With BioMysteryBench, Anthropic wants to show that Claude can solve real bioinformatics problems at an expert level. The results are promising, but come with important caveats.

Measuring how well AI models actually perform in biological research is difficult. According to Anthropic, existing benchmarks each have blind spots: knowledge tests like MMLU-Pro or GPQA check factual knowledge but not practical research skills. Benchmarks like BixBench that use real datasets evaluate models against individual scientists’ conclusions, which are themselves subjective and shaped by methodological choices. And simulated lab environments like SciGym have clear answers but don’t capture the messiness of real biological data.

That’s why Anthropic developed BioMysteryBench: 99 questions across multiple bioinformatics domains, written by specialists and based on real, noisy datasets. The key design involves answers that aren’t derived from scientific interpretations but from controllable, objectively verifiable properties of the data or independently validated metadata. Every question author had to submit a validation notebook proving the signal actually exists in the data. This approach also makes it possible to ask questions that humans might not be able to solve.

Typical tasks include identifying which organ a single-cell RNA dataset came from, or figuring out which gene was knocked out in experimental samples. Claude gets a container with bioinformatics tools, access to databases like NCBI and Ensembl, and full freedom to choose its own analysis methods. Only the final answer is scored, not the path it takes to get there.

Strong results on solvable problems, but hard tasks remain fragile

Anthropic split the tasks into two groups: 76 were considered “human-solvable” because at least one out of up to five experts found the correct answer. Another 23 tasks stumped every expert. Four originally planned questions had to be removed due to flawed formulations. For the remaining 23, Anthropic acknowledges that it’s unclear whether they are fundamentally unsolvable or just extremely difficult. Whether a larger or differently composed expert panel could have solved them also remains an open question.

On the solvable problems, Claude now matches human expert performance, according to Anthropic.

On the human-solvable set, newer Claude models perform significantly better on biology tasks: Mythos Preview reaches 82.6 percent accuracy, while Haiku 4.5 stays at 36.8 percent. | Image: Anthropic

On the hard problems that none of the selected experts could solve, Claude Mythos Preview achieves a 30 percent success rate.

On the hardest BioMysteryBench tasks, success rates remain low across the board: Mythos Preview scores 29.6 percent, while Haiku 4.5 manages just 5.2 percent. | Image: Anthropic

However, a consistency analysis that Anthropic had Claude Mythos Preview run on itself paints a more nuanced picture. Each task was attempted five times. On the solvable problems, Claude almost always either gets all five attempts right or none at all. On the hard problems, successes typically come in just one or two out of five attempts. The model stumbles onto a lucky solution path rather than following a reproducible strategy.

Anthropic identifies two strategies that set Claude apart from human testers: the model draws on a broad knowledge base and combines information directly with its ongoing analysis. When uncertain, Claude also layers multiple methods on top of each other and picks the answer that different approaches converge on.

Independent confirmation comes from CompBioBench, a similarly designed benchmark developed concurrently by Genentech and Roche that shows comparable results. BioMysteryBench is available on Hugging Face.