Google launches separate AI chips for training and inference

Google introduced two new AI chips at its Cloud Next conference this week, separating training and inference into distinct processors for the first time in the company’s eighth generation of tensor processing units.

The TPU 8t is built for training AI models, while the TPU 8i is designed for inference and handling AI agent workloads. Both chips will be available later this year, Google said.

Until now, a single chip handled both training and inference across all TPU generations. Specialized hardware for each workload became a priority as AI agent deployments grew and the performance requirements for training and inference continued to diverge, Google said.

Google designed the chips in partnership with Google DeepMind. CEO Sundar Pichai said the architecture is designed “to deliver the massive throughput and low latency needed to concurrently run millions of agents cost-effectively.”

The customer base for Google’s TPUs has been growing. Among the users Google cited are Citadel Securities, which relies on TPUs for quantitative research, and all 17 national laboratories in the U.S. Energy Department system, which run AI software on the chips. Anthropic has pledged to draw on multiple gigawatts of TPU capacity from Google, and Yahoo Finance reports that Google is in discussions to supply TPU capacity to OpenAI as well. A multiyear, multibillion-dollar agreement giving Meta access to Google’s TPUs was also struck in February, CNBC reported, citing The Information.

Since 2018, when Google first made TPUs available to outside cloud customers, the chips have served as an in-house option for companies seeking alternatives to procuring Nvidia hardware. The company does not compare the new chips’ performance directly against Nvidia’s processors.

Compared to the seventh-generation Ironwood, the TPU 8t offers 2.8 times the performance at equivalent cost, Google said. At the superpod scale, the chip supports configurations of up to 9,600 units with two petabytes of shared high-bandwidth memory, targeting more than 97% useful compute time. According to Google, that translates to cutting the time needed to develop frontier models from a span of months down to weeks.

On-chip SRAM in the TPU 8i comes to 384 megabytes, which is three times what Ironwood provided, and the chip also includes 288 gigabytes of high-bandwidth memory, with the overall design oriented toward low-latency inference. For inference workloads, the performance-per-dollar improvement over the prior generation reaches 80%, a gain Google says allows customers to handle roughly double the request volume without increasing spending.

Both chips run on Google’s Axion ARM-based CPUs and support frameworks including JAX, PyTorch, SGLang, and vLLM. They can be used as part of Google’s AI Hypercomputer system, which combines hardware, software, and networking into a unified stack.

Google launches separate AI chips for training and inference

Tags: