Re-Architecting AI For Power

The industry is becoming increasingly concerned about the amount of power being consumed by AI, but there is no simple solution to the problem. It requires a deep understanding of the application, the software and hardware architectures at both the semiconductor and system levels, and how all of this is designed and implemented. Each piece plays a role in the total power consumed and the utility provided. That is the ultimate tradeoff that must be made.

But first, the question of utility must be addressed. Is power being wasted? “We are using the power for something that has value,” says Marc Swinnen, director of product marketing at Ansys, now part of Synopsys. “It’s not wasted. It’s an industrial application of electricity, and it’s just another industry — just like steel and copper.”

In many cases, that power is offset by significant savings. “Farmers use AI to run autonomous tractors for cultivation,” says Diptesh Nandi, senior product marketing manager at Microchip. “They use AI for inferencing when spraying pesticides and fertilizers. This not only saves time, but also reduces the amount of chemicals used. It takes power to produce those, so the use of AI could save power. We’ve seen a boom in AI-powered edge devices, particularly in agriculture.”

At the Design Automation Conference this year, several academics claimed that the easy improvements in power reduction already have been taken. “I don’t agree with that,” says Frank Schirrmeister, executive director of strategic programs, system solutions, in Synopsys’ System Design Group. “We are not even close to having optimized everything. In addition, demand for the applications is growing so fast that it’s hard to catch up with power. The question may be, ‘How do I get to the lowest-power implementation? The NoC impacts, the chiplet partitioning impacts, the workload-specific architecture impacts. The desire for more performance impacts. It’s a tradeoff versus power.”

Some compare the power consumed by a computer against nature. “If you look at something like a cockatiel and consider what it can do with 20 watts inside of its brain, it can fly,” says Jason Lawley, director of product marketing in Cadence’s Compute Solutions Group. “It can mimic words. It can do complex visual understanding of its surroundings, calculate 3D, and fly between trees. When you look at it from that perspective, lots of things are possible. It’s just a matter of how long until we can catch up with those. I don’t think that AI is going to continue on the same path forever. There will be other innovations and inventions that let us keep going, such as neuromorphic computing.”

Many of the headline numbers relate to training in the data centers, but longer term, this may be the wrong focus. “Traditionally, training has dominated our compute requirements with massive datasets and extended cycles,” says Doyun Kim, senior AI engineer at Normal Computing. “Today, we are seeing a fundamental shift with test-time compute techniques where models perform multi-step reasoning chain-of-thought, tree-of-thought, and agentic workflows that trigger dozens of inference operations per query, with power consumption now rivaling training intensity. For chip designers and data center operators, this represents a major shift. Inference is becoming a first-class power consideration. But how do we address this power challenge?”

That may force design teams to be more frugal with power. “Edge AI power consumption will increase, but it’s going to be more battery-operated,” says Cadence’s Lawley. “It’s going to be more dispersed into our everyday lives. Relatively speaking, the power that we can put into those batteries is small compared to what goes on in the data centers themselves, and users will demand long battery life.”

The top level
If we assume that AI provides value, you can look at the other side of the equation. “The existing power grid was not built for AI and it can’t handle it,” says Rich Goldman, director at Ansys, now part of Synopsys. “It’s going to take a long time and be very expensive to upgrade the infrastructure. We have to look at local energy creation instead of trying to transport it from where the energy is created to where it’s needed. It’s going to be the age of small nuclear reactors.”

There are other non-carbon ways to generate the necessary power. “The good thing is that data centers can be placed anywhere there is access to power,” adds Ansys’ Swinnen. “Consider the Sahara. There is plenty of land and plenty of sunshine, and you could build solar farms. The beauty of AI is that you just run a fiber optic cable there, and you can get all your data in and out without a lot of infrastructure. You don’t need ports and roads.”

While that addresses the data center, the edge also has to be considered. “On-device edge AI execution is still an incredibly energy-intensive process when running LLMs,” says Maxim Khomiakov, senior AI engineer at Normal Computing. “Steering model outputs efficiently is a big challenge. Brute-forcing solutions is energy-intensive. One of the known techniques is generating many output traces and sub-setting the useful ones in tandem, optimizing prompt and answer. Long term, the path forward is building ASIC chips optimized for LLMs and inference-heavy workloads. The demand for inference is skyrocketing, and that is catching up to training costs.”

The amount of edge autonomy is evolving. “The main requirement for customers that use edge AI is to reduce latency,” says Microchip’s Nandi. “It takes too much time and power to send something to the data center and get the response back. One solution is to perform some compute at the edge before sending it to the data center. For example, if you are monitoring license plates on a highway, 75% of the workload is detecting the location of the license plate and tracking it along the road. Once you are able to lock on to that location, you send that data back to the cloud to do character recognition.”

Model evolution
Models continue to get larger. “The rate of change for these large language models continues, and that directly increases the consumption rate for training,” says Lawley. “It will be interesting to see at what point they reach diminishing returns, but it doesn’t seem like they’re anywhere near that point yet. They continue to put more data in and get better results. They have different refinement techniques with the initial data set, and they’ve got secondary training and other forms of training that go into these large language models.”

The goal at the moment appears to be the creation of bigger, more unified models. “The first thing you can optimize is the model itself,” says Synopsys’ Schirrmeister. “There are lots of gains to be made by making the models specific to its needs. You make these models much more specific to applications, which results in being able to constrain them. The applications you run on it and that consume all that energy are becoming more optimized, going away from the generalization.”

That may take things in a different direction. “Just as the silicon industry introduced multiple voltage domains, clock gating, and power gating to save power, we can apply similar concepts to AI systems,” says Normal’s Kim. “Like mixture-of-experts (MoE) architectures that avoid running entire models simultaneously, we can make AI systems more modular. By predicting which modules are needed in real-time and dynamically activating only necessary components — similar to workload prediction — we could achieve significant energy savings through intelligent system-level power management.”

Some optimizations require a co-design focus. “There are lots of opportunities for improvements in the software stack, such as operator fusion, layout transformations, and compiler-aware scheduling,” says Prem Theivendran, director of software engineering at Expedera. “These can unlock latent hardware efficiency, but only if the hardware exposes those hooks. This requires close coordination between hardware capabilities and software optimizations. When models, compilers, and hardware are co-optimized, the gains can be substantial, even on already-efficient accelerators.”

There are more opportunities for the edge. “Quantization is one of the most important things to get right,” says Lawley. “There are two directions we see people going. One is to go smaller. While many are using Int8 today, Int4 and sometimes even Int1 are also being looked at. Int1 gets you smaller storage, smaller bandwidth, and less compute, which are the three main areas where we’re consuming power. We see more research into mixed modes of quantization, where you might have some layers that are operating at FP16, because they’re very important, and other layers operating at Int4. We are also seeing a move back to floating point from integer, even doing FP16 and FP8 because they are finding they’re getting better results with models that are not linear in terms of how you use those 8 bits or 16 bits. You can have more granularity with floating point representations.”

Designing better hardware
There are two main approaches. The first is to design architectures that are better suited to executing AI workloads, while the second is to improve the efficiency of existing architectures. “Engineering is always a matter of abstraction, and it’s a tradeoff from this perspective because you will never really have full optimization across the design hierarchy,” says Benjamin Prautsch, group manager for advanced mixed-signal automation in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “You try to abstract as much as possible to generate more value in a shorter time. But that comes at the cost of sub-optimal designs, at least for a specific purpose. We will never really be able to find the optimum. But of course, we want to optimize, and that requires work on the whole stack, along the value chain from the top to the very bottom. We probably need both a holistic view and good point tools that can optimize point problems. The biggest problem is that AI is changing so rapidly, which is not really compatible with manufacturing timelines.”

Power can be saved at every step of the process, but it also can be wasted. “While reducing power seems straightforward — minimizing the terms in P=fCv² — it’s complicated by the inherent tradeoffs between power, performance, and area (PPA),” says Jeff Roane, group director of product management for the Digital and Sign-Off Group at Cadence. “These complications increase multi-fold in AI chip math functions due to glitch power, which is hard to measure and optimize. Thus, effective optimization, driven by accurate analysis, must occur at every design abstraction level, with architectural-level optimizations offering the largest reductions, up to 50%; RT-level, up to 20%; and gate through physical yielding up to 10%.”

At each design step, understanding the workload is important. “Power is mostly dominated by dynamic power, and that is highly vector-dependent,” says Godwin Maben, a Synopsys fellow. “AI-specific workloads are very well defined, and hence generating workloads is not an issue. Power is significantly dominated by data movement from compute to memory and back. Having a power-efficient bus architecture is critical, and even architectural decisions such as compressing data going in and out of memory are very important. Power reduction is scalable. Because of repeated instances of the same compute unit thousands of times, optimizing one unit will reduce overall power significantly.”

In all discussions, data movement stands at the top of the power concerns list. “AI workloads involve large volumes of data transferred between compute units, memory, and accelerators,” says Andy Nightingale, vice president of product management and marketing at Arteris. “To reduce energy per inference, you need to look at localized communication. Tiling, or spatial clustering techniques, are preferred over long-distance transfers. We see a future where clever interconnect design is the most significant lever SoC architects have to bend AI’s power curve.”

Other parts of the processor are less workload-dependent. “If you look back over time, matrix multiplication has been the one thing that’s constant throughout all of AI,” says Lawley. “That piece hasn’t really changed. The quantization has changed. The activation functions have changed. The bandwidth has changed. The way that people arrange the different layers has changed. But the matrix multiplication functionality is constant. We make sure that we have a very robust matrix multiplication solution, but then we have more programmability in what we do in terms of things like activation functions.”

Research continues into more significant architectural changes. “The discussions about in-memory compute are not over,” says Schirrmeister. “It helps performance and power because you don’t have to transmit the things across boundaries. You don’t need to move data to perform compute. Those are all things that haven’t been fully exploited yet. Others are looking at neuromorphic very seriously. I don’t think we are stuck with von Neumann. It’s just that there’s so much legacy depending on it. Can you do it differently, especially in the context of AI acceleration? Absolutely. There are so many approaches out there. Will anything stick? Potentially for those specific needs, like lowering energy and lowering power.”

To get closer to the operation of the brain, analog has to be considered. “There are some really interesting analog startups out there that have shown tremendous results,” says Lawley. “Unfortunately, they’ve not been able to scale across the range of operators that are needed. For things that the analog accelerators are good at, they’re really good, and at exceptionally low power. But a lot of times they have to fall back to digital. Analog is a hard manufacturing process. To have the necessary level of control, you need to make sure that all your currents are correct, your resistances are correct, your wire links have the right capacitance. It’s a more difficult problem to solve. Maybe it will get solved in the future, but companies have been trying to do that for a long time.”

EDA’s role
EDA helps with AI power reduction in two main ways. The first is to provide information that enables decision making. The second is providing tools that enable efficient implementations and optimizations. “EDA can shape AI architectures by turning what used to be guesswork into data-driven design,” says Expedera’s Theivendran. “Through design space exploration, workload profiling, and AI-assisted tuning, EDA can help architects build hardware that’s not just functional, but optimal for real AI workloads.”

The true extent of shift left becomes evident at the system level. “We’re at a point where we can’t just think about chip-level optimization anymore – we need to consider the entire stack from package to board to rack level,” says Kim. “What’s particularly critical is workload-aware system design. Different AI workloads – whether it’s training, inference, or these new test-time compute patterns – have vastly different power and thermal profiles. EDA tools need to evolve to help us analyze and optimize these full-system interactions based on actual workload characteristics. Only then can we design systems that truly maximize silicon utilization rather than being thermally throttled most of the time.”

Fast iteration of hardware architectures allows more options to be considered. “EDA needs to incorporate high-level, physically-aware planning tools,” says Arteris’ Nightingale. “Automation must allow quick iterations on topology and floorplan, simulating power-performance tradeoffs. AI-based design-space exploration can also help achieve optimal partitioning, routing, and resource placement.”

The complexity of interactions between the workload, the architectures, the transactions to the memory storage has increased. “They have simply become too intricate for people to predict,” says Schirrmeister. “There still will be some components where spreadsheets will help you to identify the impact of a cache on how much traffic will pass across a chip or chiplet boundary, which may consume more power. You may still do a back-of-the-envelope calculation and use stochastic models. But the interactions are so complex that people require AI workloads to be run on your target architecture so that you’re confident you’re doing the right things from a performance perspective.”

There remains a lot of room for human innovation. “It’s a matter of complexity, and it’s basically too complicated, and that indicates the sheer amount of waste that can be expected in the design path,” says Fraunhofer’s Prautsch. “This is not uncommon when breaking down a problem into your solution. It’s always a tradeoff, and it’s always biased. Good communication between the stakeholders is essential so that you can quickly exclude options and narrow down the most reasonable options quickly, but it doesn’t necessarily mean that the optimal solution is even on the table.”

Conclusion
The power consumption of AI is beginning to ring alarm bells, and for good reason. But this is no different from the emergence of other industries. How we deal with it is what is important. Do we create more clean power, or do we reduce the power profile in some way? Can we do better by designing outside of the semiconductor’s zone of comfort, or would that restrict the rate at which benefits may come online? Can anyone hope to fully understand the implications of the decisions they are making?

Solutions require many stakeholders to come together, something that has been difficult in the past. Today, the rate of development of software far exceeds the rate at which hardware can respond, and some are hoping that AI can help speed that up. “The holy grail is AI fully designing the chips that make AI itself more efficient,” says Normal’s Khomiakov.

Related Reading
When Standards Enable Chiplets
Nobody wants standards until the lack of them inhibits the development of the solutions that they need. That is often too late.

Re-Architecting AI For Power

Tags: