Designing Chips In The Context Of Rapidly Evolving AI

Key Takeaways:

Agentic edge AI drives long-lived, tool-mediated loops with variable demands for compute, tokens, and memory.
Edge PPA is dominated by memory hierarchy and data movement, forcing tight feature triage and robust RAS.
Rapid model churn (multimodal, MoE, new formats) requires programmable, headroom-rich compute, interconnect, and runtime.

Experts At The Table: Chip architect must contend with multiple factors when designing AI processors for fast, efficient performance, not the least of which is rapidly evolving AI models. Semiconductor Engineering sat down to discuss this with Ronan Naughton, director of product management, Edge AI at Arm; Amol Borkar, group director, product management for Tensilica DSPs at Cadence, Jason Lawley, director of product marketing, AI IP at Cadence; Sharad Chole, chief scientist and co-founder at Expedera; Justin Endo, director of marketing at Mixel, a Silvaco company; Steve Roddy, chief marketing officer at Quadric; Dr. Steven Woo, fellow and distinguished inventor at Rambus; Sathishkumar Balasubramanian, head of products for IC verification and EDA AI at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion. Click here for part one, and here for part two.

Top Row: Arm’s Naughton; Cadence’s Borkar; Cadence’s Lawley; Expedera’s Chole; and Mixel’s Endo
Bottom Row: Quadric’s Roddy; Rambus’ Woo; Siemens EDA’s Balasubramanian; and Synopsys’ Cooper

SE: What are the different types of agents being used on the edge today?

Woo: Most edge agents today fall into sensing, reasoning, and in the case of robots, planning and acting. These tasks often run together on the same device, and what matters is not just inference but how quickly the system can observe, decide, and respond. That pushes designers to rethink memory hierarchies, interconnects, and security boundaries. The agent is really the whole system working in concert, not just a neural network on a block diagram.

Chole: Let’s define why agentic AI is different from generative AI. First and foremost, there is a notion of autonomy. Generative AI is a prompt, and then you come up with a response. Agentic AI has more autonomy in high-level tasks. You’ve given them high-level tasks, and they are responsible for orchestrating, planning it out, and coming up with how to follow through. Then they have access to some sort of memory. Not all agents have memory access, but typically you have access to memory where there are user instructions provided — or similar to CLAUDE.md files, they have access to tool calls. So they’re not passive. It’s not just like the prompt that you’re given, and then this is all you can do. They are active. What that means is they can look up in the system the current date, the weather, and whether you have recently clicked on a picture or not. They have access to the API calls or tool calls that you have enabled for that. I’m not saying they have access to your root file system, but they do have access to a lot of things that, as a human, we would do on our laptop, on our own systems. And this is very useful for any coding-related practices, because they can compile, they can run tests, and things like that. That’s where it all comes from — tool calls. After that, they are thinking machines. They are not just generating something. They are planning, thinking, and running things end to end, or they’re iterating over things. When the tool calls happen, they get feedback, and based on the feedback, they think again about the plan that needs to happen. This differentiates agents from generative AI, and you can think of it as a multi-turn interaction, but the turns happen through tools. They’re not really an intervention through humans. And basically, because of this, the overall processing can get complicated. It’s not really limited to, ‘Hey, I’m going to give you this image, and you need to generate a different image from there.’ If I bound a problem like that, my tokens for input and for output are bounded. It’s not always the same case. For the agents, they are not necessarily bound. There will be a max token limit, but it’s not necessarily that. It’s bound to, for instance, a certain number always. And this creates certain challenges. Especially, what kind of task can you give it? You can change the elements to a smaller size. You can limit the complexity of tasks or tool users that they can have. But even after that, the complexity of the task will dictate the amount of processing needed to be able to finish it.

Naughton: What we’re seeing with respect to private agents is that my own hosted LLM has access to my private media, but also my calendar. And so, for example, I can have my private LLM running either scheduled or automated tasks for me and essentially be my admin and do some work for me. But we also all see a rise in coding agents on edge devices, where I can deploy multiple tasks in parallel to my coding agent on the edge, and that could be working autonomously and provide feedback to me with results afterwards. Those are two kinds of common, personal, edge-based agentic use cases. And on mobile, as well, we see new tools coming out that do rapid app navigation. I might give it an instruction, and it might open two or three apps to complete that instruction.

SE: How does an architect start a project and make decisions when model changes are inevitable?

Woo: Performance and power efficiency are increasingly dominated by memory system design and data movement. Architects need to understand target use cases and be ruthless about what earns silicon area, because every extra feature taxes PPA and adds complexity that you end up paying for later. Chip designers need to design for data movement first, because this is where performance and power efficiency battles are won or lost. Additional complexity comes from incorporating the right RAS solutions that lead to high reliability and availability to ensure predictable and trustworthy operation.

Roddy: It begs as much generality and as much flexibility as the architect can put into the system, not knowing what shape the embedded agent is going to take in the future, and what kind of horsepower it might want from a compute standpoint or communication standpoint. Think of it in terms of things your next car might have, such as an embedded vehicle health agent. When should I take my car in to get serviced? Now, you observe it with your human input. You think about what kind of driving you do. If you share a car with your spouse or your kids, who’s driving it when and where? What if the agent was smart enough to know who drives it the most? It does all the predictive maintenance, monitoring of all the systems, understands the time of the year, understands the weather, understands the weekends coming up. Stephen and his family like to go skiing every weekend. The tires are balding. The snow is going to be bad. Maybe we should take it in for new tires. There are all kinds of contextual things that the agent could know. The same vehicle sitting in a garage where grandma only takes it out on Sundays to go to church is going to have a very different need as it is driven differently. Does that kind of thing exist in the future? And does it adapt to the situation around it and communicate with the owner or the driver and learn from the driver’s interactions with it? It starts monitoring different things, or recommending different things, testing different things that are likely to happen going forward. What kind of generality do you need in the compute infrastructure to handle that type of thing?

Lawley: To me, these agents come back to multimodal AI. To Steve’s point, you have your agent in your cars doing this, but can your agent actually pick up a phone and call somebody, a human in the loop, and talk to that human? So now it’s using audio techniques to do noise suppression. It’s doing language recognition. It’s doing a language model to make an appointment for you. And then it comes back and says, ‘Hey, your car’s going to have an appointment.’ I see this whole world of agents that’s going to fundamentally change the way that we interact with compute, and especially compute at the edge.

Roddy: And to your point there, you’ve driven to Southern California from the Bay Area because you have an event, and now it’s having problems. Now it has to find a service department there. Does it figure out you have a service contract because you bought the extended warranty? Where’s your dealer? Or, what is recommended as far as independent service agents you should use, because you like to use Yelp and you like five-star places? It’s going to be smart enough to figure out how to direct you and save you time doing those kinds of things. And that’s not something that a current vehicle does. A current vehicle puts a light on the dashboard that says the oil pressure is low. That’s about it. It doesn’t do anything about remedying that situation for you.

Lawley: From an architect’s point of view, the one thing that we know is that flexibility with models is really important. There are going to be different floating-point representations. There could be ANT for models. There could be lots of different models that these agents will need to rely upon, so having your compute and what you build be flexible enough to handle a variety of different model types is incredibly important for architects.

Cooper: I agree with that. You mentioned the multimodal need. For those of us who are defining the next generation of what our NPU looks like, we’re accelerating something that you’ve combined with a host processor in the system. There’s a system-level problem here from an NPU point of view. It’s how flexible you can be to handle these emerging multimodal models that are coming out — VLA (vision-language-action), VLM (vision-language model), whatever they might be. That’s the challenge at the edge for those of us making NPUs.

Chole: I just would like to answer this from the deployment perspective. When we are running agentic workloads, they’re running long term, and what it requires is that they need to be running in the background. That becomes the priority. They have to be running in the background. And when things are running in the background, we want to make sure it is as optimized as possible. So support for MoE (mixture of experts), because we don’t have batching. So MoE models become very essential. These don’t have to be large models. Even for small models, MoE becomes essential because we don’t have batching on the edge. Support for KV (key-value) cache quantization for techniques like turbo content — those become essential because we don’t want to keep on wasting the bandwidth for loading very large KV caches, which these agents will end up having even with sparse attention. That would be interesting as well, a way to save 2X to 3X bandwidth. Then, the runtime deployment needs to support memories like prefix caching. You also need to be able to do tool calls. So we’re basically bringing in the server-class capabilities, what current data center inference providers actually support, to the edge, and we are trying to let the agents be as powerful as they can be with a minimalistic footprint. That’s my perspective from the deployment side. If you ask me how the models will evolve — zero, I hope. If you ask me what kind of agents running on the edge are better in any sense than running them in a data center, I still don’t know exactly. Unfortunately, if you have connected devices, I still cannot recommend that they run on the edge other than for privacy reasons.

SE: What are the most interesting applications you’re seeing out there today for edge AI or agentic edge AI?

Woo: Some of the most compelling applications are in systems that are time-bound, like industrial automation, robotics, and automotive sensing. These systems use agentic behavior to adapt in real time to changing inputs, not just classify what they see. From a hardware perspective, the challenge is sustaining low latency while handling continuous data streams. That combination is forcing innovation in memory bandwidth, power efficiency, and system-level integration.

Lawley: It goes everywhere. Everybody’s using it for anything under the sun, so it’s hard to pinpoint one particular thing. It’s everything that you could think of at the edge, and probably then people come up with new ideas in areas that we’ve not even thought of.

Roddy: We see a lot of manufacturers and a lot of systems companies thinking about how LLMs, in particular, and SLMs can change the human-machine interface, whether it’s how you interact with your car or how a technician interacts with a piece of equipment in a factory, or how you interact with the microwave in your kitchen. If you don’t have buttons on the microwave, and you just talk to it, does it reduce the cost? Because you don’t have to have the touch panel, you don’t have to have all the things that break, so can it lower the cost of the microwave? Can a microphone and a speaker and a display panel on a piece of equipment at the manufacturing site mean that you don’t need a 600-page manual tucked into a side panel of the big piece of equipment. Think about saving the printing costs on the manual, or the manual getting lost. When you buy a car these days you no longer get a 600-page book with all the error codes. You don’t need that now. You just talk to your vehicle, and it tells you what’s going on. So there are changes in the way things are physically built that can lower costs and increase user satisfaction, and a lot of that stuff is changing just because you could put a 30 billion parameter model on the edge. It doesn’t have to be an agent, necessarily, but it certainly is a way to interact with these things in a much different way.

Balasubramanian: I’ve seen quite a few personal health assistants coming up on the agentic side where there’s something taking action, not just sensing it. There are a lot more applications getting built as we speak. One of the things Siemens did was partner with Meta on Ray-Ban, where we are equipping factory floors with Ray-Ban Meta glasses. That’s a perfect case of humans with AI processing on the edge. You essentially have people walking on the factory floors, and with each of the machines, as you walk to a certain section, it comes up with a dashboard and says everything is green, something is wrong, or something needs maintenance. For that, I don’t know the exact details of where the processing is happening. Is it still connected to the central hub? This should be the case most likely, or it might be on the edge. Those are the industry cases that we are seeing, and where you’re inferring something, sensing something, getting the information. When you say, ‘How do I act on it?’ That’s going to be the next big thing. It’s an interesting time, and there are a lot of interesting applications happening. I have played with a lot of the note takers, and the challenge with that is also the power supply. As you’re doing a lot more, your power efficiency is much more important.

Cooper: We have this perceptive AI, and people are now really starting to figure out, ‘Oh, I have a real use case,’ or, ‘I have an example. I can add generative AI to that.’ In the automotive space, it might be in the cabin, where, in theory, you’ll be able to point out the window and say, ‘What building is that?’ And with multimodal, it can say, ‘I see where you’re pointing. I can see outside. I know where I am geographically. I understand your prompts.’ All this multimodal capability has moved forward. There is this whole idea of physical AI and robotics — cars, drones, humanoid. Nvidia is bullish on that. I don’t think everybody’s on board with having a humanoid robot in the home, folding our clothes, but they’re bullish. That’s certainly an interesting application to see where robotics is going to go, as well.

SE: Have we ever seen a rate of change like we’re seeing with AI?

Balasubramanian: No, I haven’t seen it in my experience. I have 25-plus years of experience, but over the last 20 years, I haven’t seen this much of a change. Every week there are new customers popping up, and new design starts coming up for new applications, and we are catching up with them.

Lawley: If you look at history, Intel came online with x86, and there was the race with Fairchild. That was a pretty inventive time. But this is so much broader-based than the semiconductor race. Everybody knows about it. My kids know about it. My wife knows about it. My parents know about it.

Chole: Robotics and autonomy are going to push the boundary quite a bit. We will see PetaOPS engines. We started this conversation with world models. That’s quite interesting because those will have to be run on these autonomous platforms, and they do have significant processing requirements in terms of vision, as well as tokens. So that’s maybe what we’ll be talking about in a year.

Woo: The pace of change with AI is unlike anything we have seen in modern semiconductor design. AI is compressing timelines across the entire stack, and hardware feels that pressure immediately. Requirements continue to be rewritten as new capabilities come to market, with models evolving quickly as assumptions from just a year ago may no longer hold. This is forcing a holistic approach to system design, where compute, memory, security, and I/O are planned together with software needs from the start. It is a fundamental shift in how we think about building chips for the future.

Naughton: It’s quite exponential. And the difference now is that it’s not just hype anymore. We are seeing significant productivity boosters, personal lifestyle boosters, and innovation and discoveries in AI. Maybe that’s drifting a bit away from an edge AI, but certainly the first things I mentioned are really improving people’s lives. But with it comes with risks, and we all have to be aware of those risks and take measured steps to ensure these productivity enhancements and lifestyle augmentations that we are achieving are measured against the potential risk associated with them.

Designing Chips In The Context Of Rapidly Evolving AI

Tags: