NVIDIA touts GB300 to slash AI agent inference costs

NVIDIA has highlighted new third-party performance data that it says shows its Blackwell Ultra systems can sharply cut the cost of running agentic AI and coding assistant workloads, as cloud providers begin rolling out the GB300 NVL72 platform at scale.

According to SemiAnalysis InferenceX data cited by Nvidia, GB300 NVL72 systems can deliver up to 50x higher throughput per megawatt than the earlier Hopper platform. NVIDIA said that translates into up to 35x lower cost per token for low-latency inference, typical for interactive agents and coding tools.

The claims come as inference takes a larger share of AI spending. OpenRouter data cited by Nvidia shows software-programming-related AI queries rose from 11% to about 50% last year. Agentic coding and coding assistants typically need fast responses across multistep workflows, as well as longer context windows so models can work across large codebases.

Systems at scale

Several cloud providers are already deploying GB300 NVL72 systems in production, according to Nvidia. Microsoft, CoreWeave and Oracle Cloud Infrastructure are rolling out the platform for low-latency and long-context use cases, including agentic coding and coding assistants.

Inference providers have also moved quickly onto Blackwell. NVIDIA listed Baseten, DeepInfra, Fireworks AI and Together AI as adopters. It said customers have reduced cost per token by up to 10x compared with earlier deployments.

GB300 NVL72 is a rack-scale system based on the Blackwell Ultra GPU. NVIDIA positioned it as part of a broader approach that pairs hardware advances with software optimisation across the inference stack.

Performance claims

SemiAnalysis commentary cited by Nvidia focused on generational improvements and total cost of ownership for inference.

“The improvement in generation on generation for NVIDIA is stark…From a performance per TCO basis, this suggests newer chips are not only the most performant but also the most economical option for inferencing workloads today” – SemiAnalysis, Inference X.

NVIDIA also pointed to earlier work on GB200 NVL72, based on Blackwell rather than Blackwell Ultra. A Signal65 analysis cited by Nvidia found that GB200 NVL72 delivers more than 10x the tokens per watt than Hopper. NVIDIA said that equates to one-tenth the cost per token versus Hopper in the tested scenarios.

Performance on Blackwell NVL72 systems continues to improve as software and kernels are tuned for specific workloads, including mixture-of-experts inference, Nvidia said. It cited updates across TensorRT-LLM, Dynamo, Mooncake and SGLang, and said TensorRT-LLM improvements delivered up to 5x better low-latency performance on GB200 compared to results from four months earlier.

NVIDIA also highlighted system-level features and software techniques it links to higher throughput and lower latency. These include NVLink Symmetric Memory for direct GPU-to-GPU memory access and more efficient communication. It also described “programmatic dependent launch” as a way to reduce idle time by beginning setup for the next kernel before the prior one completes.

Long-context economics

NVIDIA said the advantage of GB300 NVL72 is clearer in long-context scenarios, where models ingest very large prompts and generate sizable outputs. For a workload described as 128,000-token inputs and 8,000-token outputs, it said GB300 NVL72 delivers up to 1.5x lower cost per token than GB200 NVL72.

NVIDIA attributed the improvement to higher compute performance in NVFP4 format and faster attention processing. It said Blackwell Ultra has 1.5x higher NVFP4 compute performance and 2x faster attention processing than Blackwell.

CoreWeave linked the platform shift to broader changes in AI production workloads, where serving models at scale matters as much as training.

“As inference moves to the center of AI production, long-context performance and token efficiency become critical,” said Chen Goldberg, Senior Vice President of Engineering at CoreWeave. “Grace Blackwell NVL72 addresses that challenge directly, and CoreWeave’s AI cloud, including CKS and SUNK, is designed to translate GB300 systems’ gains-building on the success of GB200-into predictable performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale.”

Next platform

NVIDIA also used the update to outline what comes after Blackwell Ultra. It expects ongoing software optimisation to lift performance and reduce costs across the installed base of Blackwell systems.

It also pointed to Vera Rubin, its next platform, described as combining six new chips into what it calls an AI supercomputer. NVIDIA said Rubin will deliver up to 10x higher throughput per megawatt than Blackwell for mixture-of-experts inference, which it said would translate into one-tenth the cost per million tokens. It also said Rubin can train large mixture-of-experts models using one-fourth the number of GPUs compared with Blackwell.

NVIDIA said cloud deployments of GB300 NVL72 are underway and will expand for low-latency and long-context inference workloads, including agentic coding and interactive coding assistants.

NVIDIA touts GB300 to slash AI agent inference costs

Tags: