Reinventing the Network for the AI era

Ethernet has connected and carried traffic for IT assets for decades but requires some reshaping in order to meet the demands of emerging AI workloads.

After a period where public cloud has dominated as the enterprise infrastructure of choice, AI workloads are swinging the pendulum back towards on-premises infrastructure.

Recent research by the Uptime Institute suggests that almost half of AI inference workloads – AI that can reason when it encounters new data it hasn’t seen before – are centralised on-premises, with another 34% running out of colocation data centres (which could feasibly be private cloud or other equivalent infrastructure for a single organisation). By contrast, only 14% of such workloads run in the public cloud and 3% at the edge.

This shift is occurring for multiple reasons, including concerns over data privacy, regulatory compliance, security, latency, and rising cloud bandwidth costs. The Uptime survey suggests that storage and network cost is also a factor in decision-making.

What’s important is that on-premises infrastructure is able to meet the at-times intensive demands of AI workloads. In another survey, 54% of respondents ranked deploying AI effectively across their organisation as a top three priority for this year.

For organisations, the next step is to ensure on-premises environments are – and can remain – fit-for-purpose into the future to support increasingly sophisticated and data-intensive AI workloads.

Network requirements in a nutshell

AI workloads generate massive, sustained streams of data – known as “elephant flows” – that move east-west across the data centre between machines (GPUs). Up to 90% of traffic associated with an AI model is machine-to-machine communication – passed between GPUs – meaning only a very small proportion of traffic ever enters or leaves the environment via more traditional north-south networking links.

The network is a central performance factor that demands careful, specialised design to meet the unique needs of AI workloads. In particular, special attention needs to be paid to the links, and capacity of the links, connecting all the machines in the AI cluster.

Unlike traditional networks, where tasks can proceed in parallel, GPU clusters depend on having all the necessary data in place before moving forward. A delay or bottleneck affecting even a single GPU can trigger a cascading slowdown, making overall job completion time critically dependent on the slowest path in the system.

A fundamental design principle applies: avoid oversubscription. The links connecting storage and compute nodes to the network must be provisioned with sufficient capacity to ensure that no component becomes a bottleneck.

As AI cluster size grows, maintaining the right port density, bandwidth, and architecture becomes essential to preserving efficiency and performance.

What the reinvention of Ethernet looks like

Ethernet has been at the heart of high-performance on-premises architectures for more than five decades.

Work is now underway to reinvent Ethernet to ensure it can continue to play this critical role while meeting the unique demands of AI workloads.

One of the challenges for achieving high-performance networking for AI workloads is the limitation of traditional TCP/IP stacks at such high speeds, due to high CPU overhead.

Remote Direct Memory Access (RDMA) offers a solution to address this challenge. By offloading transport communication tasks from the CPU to specialised hardware, it provides direct memory access for applications, dramatically increasing performance.

Specifically, RDMA over Converged Ethernet (RoCE), combined with techniques like Data Centre Quantized Congestion Notification (DCQCN), Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and dynamic load balancing, creates a lossless Ethernet fabric purpose-built for AI.

RoCE offers compelling advantages, including that it integrates easily into existing Ethernet environments and typically comes at a lower cost, making it a strong choice for AI data centres.

With growing concern that traditional network interconnects cannot provide the required performance, scale, and bandwidth to keep up with AI demands, work is now being done by various industry groups to extend and enhance the proven Ethernet standard. Their goal is to overcome the bottlenecks that arise when exchanging massive volumes of data between compute nodes over Ethernet-based clusters, such as in AI workloads.

By adding new capabilities and features to the known and proven Ethernet technology specification, the new Ultra Ethernet aims to solve some of the challenges posed by current Ethernet technology used in the AI and HPC data centre clusters. This addresses the problem of exchanging data between compute units in a cluster that are connected over an Ethernet network.

The convergence of high-performance Ethernet, innovative AI models, and evolving enterprise needs is reshaping the infrastructure landscape. Organisations that act now to modernise their infrastructure by blending proven Ethernet technologies with next-generation fabrics like Ultra Ethernet will be best positioned to harness the full potential of AI, turning technical capability into real competitive advantage.

Reinventing the Network for the AI era

Tags: