By Marie Hattar

AI is changing the world. However, it requires massive amounts of processing capacity. Demand is doubling every 100 days, which is driving an investment boom in AI infrastructure. Data centres underpin the future of AI innovation, putting their performance firmly in the spotlight. Ensuring they are robust and reliable is arduous with the scale and complexity involved. Every element of infrastructure, from chip to GPU to server to network component and software, must be evaluated individually and together to ensure they operate seamlessly.

Let’s look at how AI is taxing data centres. AI’s rapid growth is a result of complex algorithms and models that consume significant computing power and energy. This is due to the LLMs used in GenAI, which demand massive computational capacity, further taxing data centre resources.

Case in point, Sam Altman recently claimed the rollout of OpenAI’s latest model was slowed due to the company being “out of GPUs.” What’s more, Goldman Sachs forecasts that by 2030, AI will drive a 165% increase in data centre power. This puts infrastructure in the spotlight as the industry looks for ways to create a technological environment that can support future iterations.

As infrastructure evolves, system-level evaluation is critical to ensure a reliable performance.
Scale: Every aspect of data centre operations has to grow, including power, cooling, infrastructure, storage, and bandwidth. A critical aspect of achieving this is addressing latency issues in distributed computing environments. AI clusters are prone to performance bottlenecks caused by tail latency — the lag time of the system’s slowest components.

However, compliance is not enough; a component’s performance must be evaluated to see how it handles network protocol data and forward error correction. Testing helps identify systemic inefficiencies, optimise resource allocation, and ensure that the system maintains high performance across all nodes.

Specialised hardware: AI-specific hardware is critical to deliver more computational resources. For example, Nvidia’s latest superchip improves performance by a factor of 30 while using 25 times less energy. However, these advancements demand rigorous evaluation beyond compliance tests to establish performance under peak loads. System-level validation is crucial to ensure everything operates reliably under real-world conditions.

Intelligent workload: Meeting computational demands requires moving to a disaggregated architecture so resources can be dynamically allocated. Testing can validate intelligent management and should incorporate emulation to benchmark network fabrics along with dynamic resource allocation and auto-scaling.

AI models will continue to fuel exponential growth for more computational resources, and this is driving the arms race to modernise infrastructure. However, if Goldman’s projections are to become a reality, rigorous evaluation at the component and system level is vital to find inefficiencies and ensure that every aspect of data centres is robust and optimised at the necessary scale.

The writer is the senior VP, Keysight Technologies