AI isn't throttling HPC. It is HPC • The Register

Opinion In recent discussions with industry vendor sales/marketing types, I’ve been hearing that HPC demand is falling off while AI system demand is continuing to increase. I’ve also seen articles implying that AI is somehow displacing HPC. Huh?

Ok, this is The Reg, so I’d like to open this semi-rant by stating that the idea AI growth is smothering HPC spending is, well, stupid. Why? Because AI is a subset of a very broad HPC (High Performance Computing) category of workloads.

First, what is HPC? It’s not an application; it’s a loose term that covers apps and workflows many domains from financial services to pharma to manufacturing to a whole bunch of others. These are workloads that are demanding enough and important enough to justify the investment of time and money to run.

AI is an important thing, and it seems to be gaining some visibility (insert a snarky phrase here) but it’s still a subset of HPC. Here are some of the things they have in common.

Both AI (including ML, inference, and generative) and traditional HPC workflows are complex and computationally intensive. Garden variety systems and infrastructures simply can’t handle the time to solution and accuracy of solution requirements posed by HPC or AI workflows.

A high-performance infrastructure is critical to both types of workloads. While you could certainly run OpenFOAM or train an LLM on a laptop, the time to solution would be so long that the results wouldn’t be relevant (or you’ll be retired) when you finally get them. The complexity of the models you use and the size of your data sets would also have to be severely constrained.

There is a never-ending need/desire for better accuracy and to address larger problems for both HPC and AI. This means, for example, analyzing more compounds in more permutations in drug discovery, adding more data to a machine learning model, or throwing another 100 billion parameters into an LLM.

Some will argue “but AI uses accelerators like GPUs and many HPC applications don’t, so they’re different.” Errr… so what? HPC is not an application…reread the “What is HPC?” paragraph above.

Just because AI sellers and users might not consider AI as being related to HPC doesn’t mean it isn’t. There are many users of HPC who don’t use the term “HPC” to describe what they’re doing. They might call it technical computing or refer to it by any number of terms. But when you look under the hood, you will see HPC-like applications and HPC-type gear (fast CPUs, accelerators, clustered systems with high-speed interconnects, MPI, etc.) making it all work at scale.

Soon, we won’t even be having this discussion. AI will be folded into most applications and workflows. We’re already seeing this in commercial apps, popular ones like getting automatic notifications if someone is using your microwave (which will be so much better with AI, won’t it?)

Drug discovery? Figuring out fusion power? Finding new materials? Manufacturing nearly anything? All are traditional HPC and all will soon have AI grafted onto them. But there are many new areas that aren’t traditional HPC that will soon need HPC infrastructures. Areas like predictive analytics for personalized health issues (or financial issues), optimized agriculture, improved manufacturing efficiency and quality, cyber security, and, yeah, customer service – although it’s pretty crappy right now (he said after spending way too long on a simple travel change.)

Industry analyst types (like me) love to toss out numbers, particularly spending and market predictions. But trying to quantify “AI Spending” vs “HPC Spending” now is mostly a guess combined with a spreadsheet exercise in trying to trick out sales numbers and predictions from studying financial disclosures, market chest beating, and more likely, magic 8 balls.

HPC isn’t dying, it’s not even sick

All of the things that we do now will continue to be done, including looking at molecules and other small things or designing jetliners or predicting the weather more precisely. But figuring out the absolute maximum price you’ll pay for a flight next month from A to B? That’s not traditional HPC, it’s new and it’s AI, but it will still require high performance systems to quality and speed of solution requirements.

The industry supporting HPC with products and services isn’t in danger either. AI and HPC have virtually the same requirements. This means the market for these offerings is going to expand as datacenters realize they need to remake their IT infrastructure so that it runs new AI workloads and AI infused apps with the scale and performance they need. Oh, and at a price that doesn’t break the bank. How datacenters are going to cope with this sea change is a much more relevant issue.

The real challenges

Estimates are all over the map, but I think the average rack today (pre-AI mania) probably consumes 15-18 kW under full load. In the AI world, a single 8U node with eight GPUs will easily slurp down 10 kW or 40 kW per rack. That’s what you’re looking at today, and these numbers are heading upwards with new generations of GPUs and CPUs.

When you’re adding AI infrastructure, the first thing to investigate is if your electrical service can provide enough juice to power the systems (including storage and network.) One of the most overlooked ways of getting more electrical capacity is to prune your existing power drawing equipment. I think most datacenters who perform a rigorous system audit will find that they’re wasting 20 percent of their electrical consumption. I’ve test-driven this assumption past quite a few datacenter types and their response was “If we’re only wasting 20 percent, I’d be pretty happy.” There are some tools out there that can help you figure out what’s being used, what isn’t, and how you can combine workloads to get higher system utilization.

Next you have to take a hard look at your cooling capacity. Computer equipment does a fantastic job of converting electricity to heat, can your current cooling handle the extra load?

If you’re still using air cooling exclusively, this might be a great time to investigate liquid. Modern computing equipment can run at higher temps than most realize. In liquid cooling terms, this means inlet liquid temperatures of 80°F (c 27 °C) or so. If your ambient temperature isn’t on the tropical side, you can get most of your cooling “free” through rooftop air to air heat exchangers with much lower reliance on traditional chillers. The combination of advances in cooling technology, more choice in cooling solutions/vendors, plus higher costs for electricity make liquid cooling a TCO win for mid-large sized datacenters.

The real key is embracing ruthless IT efficiency and relentless measuring of efficiency metrics. It’s making sure workloads are running on the best and most efficient configurations and that you’re not wasting time/energy/money on idle or underutilized systems. It’s tracking changing conditions and adjusting for them, doing your best to plan and, hopefully, get out of the firefighting business. Hard? Yes. Possible? Also yes. ®

Dan Olds is a veteran analyst of the HPC industry and runs his own research agency these days. He has written extensively for The Reg and his work is listed here.

AI isn’t throttling HPC. It is HPC • The Register

Tags: