Can Edge AI Keep Up?

Key Takeaways:

Model development is outpacing silicon design cycles, so edge AI architectures must prioritize adaptability.
The required cadence for model updates is highly application-dependent and is closely tied to product lifetime and operational risk.
Adaptability can conflict with power, performance, and area targets, so effective heterogeneous architectures and robust software/compiler toolchains are essential.

Experts At The Table: Today’s chip architect must contend with multiple factors when architecting AI processors for fast and efficient performance set against the context of rapidly evolving AI models. Semiconductor Engineering sat down to discuss this with James McNiven, vice president of client computing, Edge AI at Arm; Amol Borkar, group director, product management for Tensilica DSPs at Cadence, Jason Lawley, director of product marketing, AI IP at Cadence; Sharad Chole, chief scientist and co-founder at Expedera; Justin Endo, director of marketing at Mixel, a Silvaco company; Steve Roddy, chief marketing officer at Quadric; Dr. Steven Woo, fellow and distinguished inventor at Rambus; Sathishkumar Balasubramanian, head of products for IC verification and EDA AI at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion. Click here for part one.

Top Row: Arm’s McNiven; Cadence’s Borkar; Cadence’s Lawley; Expedera’s Chole; and Mixel’s Endo
Bottom Row: Quadric’s Roddy; Rambus’ Woo; Siemens EDA’s Balasubramanian; and Synopsys’ Cooper

SE: AI model porting is an important aspect of edge AI processor design. But when we look at ‘fast and efficient’ in that process, how often are the target AI models changing? And how fast does a silicon vendor or core vendor need to respond to port new models for their customers? Does that vary by end market segment?

Roddy: In some segments, there’s acceleration in the rate of change in those models. You see it, for example, in anything automotive or robotics, where vast changes are happening, and the shift from individual standalone models chained together to these world models that are happening now, like vision language action (VLA) models, where you’re combining vision processing and language processing and control action-type stuff. There are differences in characteristics between traditional vision processing, which is really compute-bound. You’ve got a relatively small model and a ton of pixels in a 4K image. That’s one type of compute where you’re worried about MAC density. For language models, you’re typically worried about streaming X amount of weights. You’ve got your 30 billion parameters and VLAs. Merge those, and now you can do both those things. So it leans on the general-purpose nature of the compute to be appropriate for a whole variety of tasks. That means you’re pushing more generality, and those models are rapidly iterating and changing. There are some applications where that may not be the case. Our technology is most valued as more people realize they need to handle new models that maybe haven’t even been invented yet, like new operators. You could have the same type of appliance, like a $49 camera that you put above your front door, that’s looking for porch pirates stealing your packages. You buy it, you stick it up there, and it runs on lithium-ion batteries for two to three years. I happen to have one on my front door. I’ve never updated the firmware, and I’m never going to change it. When it dies, I’ll probably rip it down, throw it away, and buy a new one. It’s a disposable thing. At the other end of the spectrum, you’ve got $1,000 cameras that sit up on light poles and monitor traffic, 10-year lifespans; cars, 20-year lifespans — anything with a long lifespan. This question about models changing really gets emphasized. So the designer of the appliance, or the designer of the chip, really has to think through the lifespan of their product. What’s the likelihood of change? And for most interesting applications today, the answer is it will change before you even take the thing out of the box. It will change before the end of the month. It’s happening at that rapid a rate. So there’s a premium on flexibility today that wasn’t there three years ago, when static vision kinds of tasks were the predominant thing that AI engines at the edge really focused on. It’s a broader set of tasks now.

Woo: New models and optimizations are coming out so quickly that it’s impossible for hardware vendors to turn designs quickly enough to chase every new model and optimization. Customers understand this, but they also expect rapid enablement of faster processing, higher memory bandwidths, and some specialization when a model family becomes dominant. That puts pressure on chip architectures and software performance engineers to support quick and efficient porting to improve the job throughput and latency. In consumer and vision-centric edge devices, response windows are short, and competitive differentiation depends on speed and accuracy. In safety-critical markets, the models put a high priority on safety since the cost of getting it wrong is much higher.

Balasubramanian: It really depends on the application. For example, we do a lot of factory automation at Siemens. If you get into factory automation, like an automotive manufacturing line, and you have an edge AI [device], the environment is not supposed to change. So the frequency of models being changed is much less compared to when I drive my Rivian or Tesla. And my automotive application, because it needs the unknowns a lot more, is unbound, so you really need to keep up with updating your models. You need to have the mechanism to do it in real-time, or as soon as possible, because it’s mission-critical. So it really depends on the applications. Even in an industrial setting, the moment that failure happens, or something happens that you’re not trained for, there should be a way to go and change it or update it, even though the frequency might be different.

Cooper: I agree it’s application-dependent. If you’re designing something that takes a year to get to chip, a year to productize, and it’s going to live in the market for 5 to 10 years, the models are going to change, so there must be some flexibility built into the IP for that to happen. The rate of change is interesting. With CNNs, we saw a 10-year evolution, where there was a focus on performance, and then there was a focus on efficiency. We’re seeing that on large language models now. They were huge parameters, but now they’re becoming smaller in small language models (SLMs). That is a continuous churn, so you must have some level of flexibility in your architecture. On the other hand, we’re all striving to have that zero power, zero area, infinite performance, so there’s a tradeoff as to how efficient I can make my architecture and still make it programmable. That’s a challenge we all face.

Borkar: The models are changing really fast, to nobody’s surprise. They’re changing daily, hourly, even by the minute. If you’re getting updates from Hugging Face, you’re probably getting emails every couple of hours telling you there’s a new variant of an SLM or VLM, or a multi-modal model that is coming up over here. The other challenge with it is also that it’s a market-driven effort, because all the industry companies are being incentivized to start putting AI into their solutions. Whether you like it or not, we have to start putting in some AI over there. It may not be the best fit, but that’s what’s happening. We’re getting a downstream push of a lot of AI models coming into places where, classically, you didn’t see AI being used, but now everyone’s using AI, and we’ve got to go along with the other cool kids. We also have to start using AI. That being said, the biggest problem we all face in the industry is that since we’re in the embedded space. It’s typically not like a Windows system where you just double-click an X and suddenly the model works. All of us strive to do our best, but we all face challenges, like these models coming in every day. There are new types of operator layers coming in. Realistically, the amount of workforce in our companies is not the amount of the big GPU company, in which there are millions of people working on porting models to get them to run on their IP. But it is a problem. It’s a good problem to have that there is demand to run it on these embedded systems. But how do we keep up with it? There are two aspects to it — hardware and software. The hardware aspect being that, if you look back a few years, there were DSPs, there were NPUs, there were GPUs, there were GPGPUs. But there isn’t one magic bullet that can solve everything. We’ve got NPUs, and we’ve got DSPs. The challenge that we’re facing is that not everything runs on one piece. You usually need to have some type of heterogeneous subsystem, maybe an AI co-processor with an NPU plus a CPU, and provide that level of flexibility for consuming the networks. The big challenge is that as these models are evolving every day, typically we end up with new operators and new layers being introduced every day, many times with our solutions. If it’s things like NPUs compared to DSPs, it’s a little bit more hardened. You need to know a lot of what you’re going to be encountering upfront, and if you do know much of that ahead of time, you will get the best performance, best power, best energy. But as we all know, if it’s something hard and you encounter something that you are not supposed to encounter, then you can either crash miserably or you can find some way to fall back. This is where the heterogeneous solution comes in. Then, the software side is also very important, because, as many of us are on the hardware side, we take software for granted. ‘Oh, if you can’t get this working in hardware, we’ll just throw it over to the software guys.’ And when you’re on the software side, you’re like, ‘Hey, if this doesn’t work, we’ll make it work on the hardware.’ So it’s not a magic bullet. Having the whole compiler flow, having it be able to map to your hardware, is a lot easier said than done, and not just mapping to your hardware. If you cannot map directly, do you have some countermeasures to do some emulation for those operators or layers to run? The end customer who’s going to be deploying the solution, as these networks have evolved, has primarily focused on how to feed into the network and how to get the inference on the side. Many of them are looking at how much runs on the NPU, how much runs on a GPU, how much runs on the DSP, and how much runs on the CPU. But when these networks are coming in every day and evolving every day, one of the big things we have seen is, yes, best PPA, best energy. But at the same time, can I feed it in from the left side, and get a result from the right side? All these factors play a very important role.

Chole: Based on my experience, I see how fast the model is changing. It’s kind of a function of where the NPU sits in the pipeline. Is it close to the sensor versus close to the application? So for NPUs, which are close to the sensors, noise reduction applications don’t necessarily change that often. There can be parameter changes, there can be small architectural changes, but not too often, because it’s very closely tied to the sensor. Sensors don’t change the workload, don’t change the FPS, don’t change the latency, so there’s no requirement for the vendor side to be able to change that. But if you go toward applications, especially the control plane or any user interaction, they have a lot more flexibility. And as the models are evolving in the data centers or from academia, all those techniques, different quantization techniques, or different ways of optimizing the structure architecture of the model come in, that would have to be supported through the entire software and hardware stack. What we see as challenging is not really supporting new models. It’s much harder to support new models with performance because the optimization techniques that have to be used might not always exist in that generation on the NPU. Then it is the race between what architectural changes are allowed versus how much benefit you can get while constraining the AI model architecture to the NPU or to the hardware possibilities. For instance, if hardware only supports a certain type of quantization, what can you get out of that? I do see that our customers understand that at a certain level, but the industry as a whole does not necessarily care for that, because the industry as a whole wants to always move ahead, and then we are always playing that catch-up game.

Lawley: There are two important models for a customer. The second most important model is the model that they can give us, the public one, which they will want to know the results of. But the first, most important model to them is the one they can’t give out. It’s their secret one, their secret sauce. And that’s where it becomes incredibly important for the software and the compilers to be able to take that network that we can’t see, that we don’t have access to, and compile it and lower it down so that it works optimally on whatever the hardware it’s going to run on. As you see these models evolving, the compilers being able to keep up with the evolution of the networks and the operators is incredibly challenging, incredibly important, and incredibly expensive. That’s where a lot of IP companies will start to win out, because we can spread that software cost across lots of different customers, whereas for somebody developing their own custom accelerator, it gets a lot more difficult for them on the software side, not necessarily the hardware side.

Roddy: Good point. That’s absolutely critical. No downstream OEM wants to be reliant upon the box builder or the chip builder, or, God forbid, the IP licenser, who are three levels removed to port a new model. The tooling has to be bulletproof. That’s the direction of travel. Whatever your underlying architecture is, and I can argue all day long that heterogeneous architectures are a bad idea but it doesn’t matter, the software has to be able to allow the car manufacturer, who’s buying an ADAS system from a Tier One supplier, who bought a chip from a semiconductor supplier, who bought a processor core from Cadence or Expedera or Quadric or whomever, that automotive company with their data scientists have to be able to land their new, updated algorithm with high performance on whatever that high performance piece of the architecture is. If it’s an accelerator, okay, fine, you have to be an accelerator. There’s no, “Yeah, it runs, but it only runs on the CPU and runs at 1/20th the speed.” It has to run at speed, easily, without 12 layers of NDAs between the data scientist somewhere at the car company and the processor architect. That doesn’t work. It has to push it out to the edge. And that’s how all this has to happen. Whether it’s an agentic thing in an industrial site or something in a car, we, the IP vendors in this small group of architects, can’t be the pinch point for all these new models.

SE: With the flurry of activity around applying agentic AI, how does that excitement change the type or the frequency of the workloads that you’re seeing?

Balasubramanian: What we are seeing with agentic AI is a lot of folks, some big, big companies, GPU providers, playing with the floating point position to trade off accuracy and being able to handle a lot more with the given memory, because the requirements are different in terms of orchestrations with the whole open cloud. There’s quite a bit of experimentation happening. Also, the workload is increasing. There’s a lot more orchestration, a lot more unknowns. How do we adapt the edge AI to do it? If they change the floating-point position in terms of model updates, is the IP flexible enough to handle that? Or do they need to change something very basic, even swap out the architecture or anything? How does that handle get handled?

Roddy: First level, we don’t care. The whole agentic AI thing is fascinating because it represents a huge step function in the demand for inference, or demand for tokens. Up till now, for the most part, it’s the human walking up and typing something, whether you’re in a session with Claude on your PC, or you’ve got an industrial boiler or an elevator shaft or something where a technician shows up and queries the machine that triggers the inference, that triggers the run. It triggers that demand for tokens to go figure something out. It’s plausible to think that all edge AI could go back to the cloud for big inference when it was only triggered by human behavior. Suddenly, you’ve got 24/7/365. You’ve got some agent that you set up such that every minute we’re going to run the monitoring on the elevator shaft, or we’re going to listen to vibrations on our piece of equipment and run that model. That has to be fully self-contained on the edge. It just has to be. If you run a big factory, and you’ve got a thousand things that are instrumented, if those thousand things are pumping out hundreds of thousands of queries a day out to the cloud and consuming tokens, you’re not going to spend $10,000, $20,000, $30,000 a day for tokens monitoring your factory. It has to all happen locally. All these SLMs, all these VLAs, have to be fully self-contained. For those of us in the edge AI world, it means making sure that we can have flexible platforms. And it boosts the amount of silicon that you put in the device — more horsepower, more TOPS. It’s one thing if you have a two-TOPS solution that does some local processing and goes back to the cloud twice a day, but you can’t go back to the cloud 500 times a day. So you have to beef up the solution there, beef up the memory there, beef up what’s happening in the device, and only go back to the cloud when an aberration occurs, not every single time something normal happens. It’s going to be a fascinating change in the architecture. Will it mean more big data centers or fewer big data centers? That remains to be seen. It’s one of those things where the total demand for tokens will just explode, and it will saturate the data centers. But it’s also going to saturate what we have in our hands and our devices.

Woo: What we are seeing with agentic AI is not just more inference, but longer-lived workloads that build up deeper contexts over time. That changes the hardware conversation from shorter-term, more ephemeral work into sustained efficiency, data movement, reliability and availability, and power management. As agents communicate with other agents, individual workloads will amplify, and memory capacity and bandwidth needs will also grow. For chip designers, this pushes architectures towards more efficient compute and data movement through tighter integration, and intelligent data handling through memory tiering beyond simply improving compute.

Chole: The workloads that we are seeing on agentic AI are very large in terms of tokens, so we have to break them down into two parts. First are the inputs that we are actually sending to the system prompt. Two to three years ago, we started fine-tuning the models. As the models started becoming larger and larger, the benefit of fine-tuning the models actually diminished compared to the prompt engineering. That means if you can write a good system prompt and a good user prompt, you actually get a huge benefit out of what you would like to do in terms of accuracy, rather than just fine-tuning the model for certain use cases. Based on my industry experience, you actually lose accuracy when you start fine-tuning the model, because you lose the generalization that is very critical for LLMs, so system prompts are getting larger. And just to give you the idea of the scale, they range from 4,000 to 5,000 tokens, to like 20,000 to 30,000 tokens. It varies based on the kind of application. I’m talking about a server-class application here. For any agentic application, like code reviews, you want to build an architecture, you want to actually review documents, summarize documents, those kinds of things. Those are one-shot agent applications, purely one-shot, but because the ability of the LLMs has increased to comprehend a very complex task, nowadays, we have to type very little for AI to understand our intent. That wasn’t the case six months ago. Six months ago, I had to write a complete PID to say what I needed to do. ‘Don’t deviate from this.’ I don’t have to do that anymore, because there’s a much better training data set available for understanding human intent. That means we can give very large, complicated tasks to the agents while expecting them to actually perform them. And this generates a secondary thing, which is, like the thinking token increase, also the output tokens increase. So before, when we used to define a workload as maybe 1,000 or 2,000 tokens. That’s no longer the case. We are talking about tens of thousands of tokens, and this determines where the application should reside. I’m not 100% sure that an agentic application can decide on the edge. I don’t even think it’s going to benefit, in a certain sense, for them to be on the edge, if this is the workload class that we need to have. As an industry, we need to figure out what kind of agents should be on the edge. They cannot be the heavy-duty agents that data centers are using nowadays, because that’s practically impossible. We cannot really run, for instance, my cell phone for two hours and get back, ‘Hey, this is the answer to your question.’ That’s not what we are expecting. In terms of privacy sensitivity, in terms of latency sensitivity, we still have to figure out what kind of applications we need to run on the edge to be able to solve them.

Cooper: I look at agentic AI almost as a system-level problem from an NPU point of view. It’s really about how well you can do this traditional perceptive AI, where you’re dealing with the sensors, and how well you can pivot and do this compute, or this memory-bound challenge of large language models, VLAs, that will then support agentic AI. It’s not like I have customers asking, ‘How well can your NPU run agentic AI?’ They ask, ‘How well can you generate your tokens per second?’ Or, ‘How well can you run these specific models?’ It’s an important thing, but it’s not necessarily something from an NPU that we measure the performance of an agentic AI, because that’s a system-level problem more than a specific NPU problem.

Lawley: If you think you understand how agentic AI will be used at the edge, then you don’t understand agentic AI yet. There’s going to be a lot of things that we don’t see right now. As this evolves, this is going to be a big evolutionary step, and probably the next evolutionary step when it comes to inference at the edge will be this agentic step. How it all plays out always comes back to the three things. How much power does it consume? How much data movement does it need? How much compute will it need?

Can Edge AI Keep Up?

Tags: