{"id":13730,"date":"2026-04-23T07:57:11","date_gmt":"2026-04-23T07:57:11","guid":{"rendered":"https:\/\/www.europesays.com\/ai\/13730\/"},"modified":"2026-04-23T07:57:11","modified_gmt":"2026-04-23T07:57:11","slug":"can-edge-ai-keep-up","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ai\/13730\/","title":{"rendered":"Can Edge AI Keep Up?"},"content":{"rendered":"<p style=\"font-weight: 400;\">Key Takeaways:<\/p>\n<p>Model development is outpacing silicon design cycles, so edge AI architectures must prioritize adaptability.<br \/>\nThe required cadence for model updates is highly application-dependent and is closely tied to product lifetime and operational risk.<br \/>\nAdaptability can conflict with power, performance, and area targets, so effective heterogeneous architectures and robust software\/compiler toolchains are essential.<\/p>\n<p style=\"font-weight: 400;\">Experts At The Table: Today\u2019s chip architect must contend with multiple factors when architecting AI processors for fast and efficient performance set against the context of rapidly evolving AI models. Semiconductor Engineering sat down to discuss this with James McNiven, vice president of client computing, Edge AI at <a href=\"https:\/\/semiengineering.com\/entities\/arm\/\" rel=\"nofollow noopener\" target=\"_blank\">Arm<\/a>; Amol Borkar, group director, product management for Tensilica DSPs at <a href=\"https:\/\/semiengineering.com\/entities\/cadence-design-systems\/\" rel=\"nofollow noopener\" target=\"_blank\">Cadence<\/a>, Jason Lawley, director of product marketing, AI IP at <a href=\"https:\/\/semiengineering.com\/entities\/cadence-design-systems\/\" rel=\"nofollow noopener\" target=\"_blank\">Cadence<\/a>; Sharad Chole, chief scientist and co-founder at <a href=\"https:\/\/semiengineering.com\/entities\/expedera\/\" rel=\"nofollow noopener\" target=\"_blank\">Expedera<\/a>; Justin Endo, director of marketing at <a href=\"https:\/\/semiengineering.com\/entities\/mixel-inc\/\" rel=\"nofollow noopener\" target=\"_blank\">Mixel<\/a>, a Silvaco company; Steve Roddy, chief marketing officer at <a href=\"https:\/\/semiengineering.com\/entities\/quadric\/\" rel=\"nofollow noopener\" target=\"_blank\">Quadric<\/a>; Dr. Steven Woo, fellow and distinguished inventor at <a href=\"https:\/\/semiengineering.com\/entities\/rambus-inc\/\" rel=\"nofollow noopener\" target=\"_blank\">Rambus<\/a>; Sathishkumar Balasubramanian, head of products for IC verification and EDA AI at <a href=\"https:\/\/semiengineering.com\/entities\/mentor-a-siemens-business\/\" rel=\"nofollow noopener\" target=\"_blank\">Siemens EDA<\/a>; and Gordon Cooper, principal product manager at <a href=\"https:\/\/semiengineering.com\/entities\/synopsys-inc\/\" rel=\"nofollow noopener\" target=\"_blank\">Synopsys<\/a>. What follows are excerpts of that discussion. Click <a href=\"https:\/\/semiengineering.com\/fast-isnt-fast-enough-redefining-metrics-for-edge-ai\/\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a> for part one.<\/p>\n<p><img data-recalc-dims=\"1\" fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-24275148\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-08-at-12.34.14-PM-1.png\" alt=\"\" width=\"906\" height=\"470\"  \/><\/p>\n<p>Top Row: Arm\u2019s McNiven; Cadence\u2019s Borkar; Cadence\u2019s Lawley; Expedera\u2019s Chole; and Mixel\u2019s Endo<br \/>Bottom Row: Quadric\u2019s Roddy; Rambus\u2019 Woo; Siemens EDA\u2019s Balasubramanian; and Synopsys\u2019 Cooper<\/p>\n<p style=\"font-weight: 400;\">SE: AI model porting is an important aspect of edge AI processor design. But when we look at \u2018fast and efficient\u2019 in that process, how often are the target AI models changing? And how fast does a silicon vendor or core vendor need to respond to port new models for their customers? Does that vary by end market segment?<\/p>\n<p style=\"font-weight: 400;\">Roddy: In some segments, there\u2019s acceleration in the rate of change in those models. You see it, for example, in anything automotive or robotics, where vast changes are happening, and the shift from individual standalone models chained together to these world models that are happening now, like vision language action (VLA) models, where you\u2019re combining vision processing and language processing and control action-type stuff. There are differences in characteristics between traditional vision processing, which is really compute-bound. You\u2019ve got a relatively small model and a ton of pixels in a 4K image. That\u2019s one type of compute where you\u2019re worried about MAC density. For language models, you\u2019re typically worried about streaming X amount of weights. You\u2019ve got your 30 billion parameters and VLAs. Merge those, and now you can do both those things. So it leans on the general-purpose nature of the compute to be appropriate for a whole variety of tasks. That means you\u2019re pushing more generality, and those models are rapidly iterating and changing. There are some applications where that may not be the case. Our technology is most valued as more people realize they need to handle new models that maybe haven\u2019t even been invented yet, like new operators. You could have the same type of appliance, like a $49 camera that you put above your front door, that\u2019s looking for porch pirates stealing your packages. You buy it, you stick it up there, and it runs on lithium-ion batteries for two to three years. I happen to have one on my front door. I\u2019ve never updated the firmware, and I\u2019m never going to change it. When it dies, I\u2019ll probably rip it down, throw it away, and buy a new one. It\u2019s a disposable thing. At the other end of the spectrum, you\u2019ve got $1,000 cameras that sit up on light poles and monitor traffic, 10-year lifespans; cars, 20-year lifespans \u2014 anything with a long lifespan. This question about models changing really gets emphasized. So the designer of the appliance, or the designer of the chip, really has to think through the lifespan of their product. What\u2019s the likelihood of change? And for most interesting applications today, the answer is it will change before you even take the thing out of the box. It will change before the end of the month. It\u2019s happening at that rapid a rate. So there\u2019s a premium on flexibility today that wasn\u2019t there three years ago, when static vision kinds of tasks were the predominant thing that AI engines at the edge really focused on. It\u2019s a broader set of tasks now.<\/p>\n<p>Woo: New models and optimizations are coming out so quickly that it\u2019s impossible for hardware vendors to turn designs quickly enough to chase every new model and optimization. Customers understand this, but they also expect rapid enablement of faster processing, higher memory bandwidths, and some specialization when a model family becomes dominant. That puts pressure on chip architectures and software performance engineers to support quick and efficient porting to improve the job throughput and latency. In consumer and vision-centric edge devices, response windows are short, and competitive differentiation depends on speed and accuracy. In safety-critical markets, the models put a high priority on safety since the cost of getting it wrong is much higher.<\/p>\n<p style=\"font-weight: 400;\">Balasubramanian: It really depends on the application. For example, we do a lot of factory automation at Siemens. If you get into factory automation, like an automotive manufacturing line, and you have an edge AI [device], the environment is not supposed to change. So the frequency of models being changed is much less compared to when I drive my Rivian or Tesla. And my automotive application, because it needs the unknowns a lot more, is unbound, so you really need to keep up with updating your models. You need to have the mechanism to do it in real-time, or as soon as possible, because it\u2019s mission-critical. So it really depends on the applications. Even in an industrial setting, the moment that failure happens, or something happens that you\u2019re not trained for, there should be a way to go and change it or update it, even though the frequency might be different.<\/p>\n<p style=\"font-weight: 400;\">Cooper: I agree it\u2019s application-dependent. If you\u2019re designing something that takes a year to get to chip, a year to productize, and it\u2019s going to live in the market for 5 to 10 years, the models are going to change, so there must be some flexibility built into the IP for that to happen. The rate of change is interesting. With CNNs, we saw a 10-year evolution, where there was a focus on performance, and then there was a focus on efficiency. We\u2019re seeing that on large language models now. They were huge parameters, but now they\u2019re becoming smaller in small language models (SLMs). That is a continuous churn, so you must have some level of flexibility in your architecture. On the other hand, we\u2019re all striving to have that zero power, zero area, infinite performance, so there\u2019s a tradeoff as to how efficient I can make my architecture and still make it programmable. That\u2019s a challenge we all face.<\/p>\n<p style=\"font-weight: 400;\">Borkar: The models are changing really fast, to nobody\u2019s surprise. They\u2019re changing daily, hourly, even by the minute. If you\u2019re getting updates from Hugging Face, you\u2019re probably getting emails every couple of hours telling you there\u2019s a new variant of an SLM or VLM, or a multi-modal model that is coming up over here. The other challenge with it is also that it\u2019s a market-driven effort, because all the industry companies are being incentivized to start putting AI into their solutions. Whether you like it or not, we have to start putting in some AI over there. It may not be the best fit, but that\u2019s what\u2019s happening. We\u2019re getting a downstream push of a lot of AI models coming into places where, classically, you didn\u2019t see AI being used, but now everyone\u2019s using AI, and we\u2019ve got to go along with the other cool kids. We also have to start using AI. That being said, the biggest problem we all face in the industry is that since we\u2019re in the embedded space. It\u2019s typically not like a Windows system where you just double-click an X and suddenly the model works. All of us strive to do our best, but we all face challenges, like these models coming in every day. There are new types of operator layers coming in. Realistically, the amount of workforce in our companies is not the amount of the big GPU company, in which there are millions of people working on porting models to get them to run on their IP. But it is a problem. It\u2019s a good problem to have that there is demand to run it on these embedded systems. But how do we keep up with it? There are two aspects to it \u2014 hardware and software. The hardware aspect being that, if you look back a few years, there were DSPs, there were NPUs, there were GPUs, there were GPGPUs. But there isn\u2019t one magic bullet that can solve everything. We\u2019ve got NPUs, and we\u2019ve got DSPs. The challenge that we\u2019re facing is that not everything runs on one piece. You usually need to have some type of heterogeneous subsystem, maybe an AI co-processor with an NPU plus a CPU, and provide that level of flexibility for consuming the networks. The big challenge is that as these models are evolving every day, typically we end up with new operators and new layers being introduced every day, many times with our solutions. If it\u2019s things like NPUs compared to DSPs, it\u2019s a little bit more hardened. You need to know a lot of what you\u2019re going to be encountering upfront, and if you do know much of that ahead of time, you will get the best performance, best power, best energy. But as we all know, if it\u2019s something hard and you encounter something that you are not supposed to encounter, then you can either crash miserably or you can find some way to fall back. This is where the heterogeneous solution comes in. Then, the software side is also very important, because, as many of us are on the hardware side, we take software for granted. \u2018Oh, if you can\u2019t get this working in hardware, we\u2019ll just throw it over to the software guys.\u2019 And when you\u2019re on the software side, you\u2019re like, \u2018Hey, if this doesn\u2019t work, we\u2019ll make it work on the hardware.\u2019 So it\u2019s not a magic bullet. Having the whole compiler flow, having it be able to map to your hardware, is a lot easier said than done, and not just mapping to your hardware. If you cannot map directly, do you have some countermeasures to do some emulation for those operators or layers to run? The end customer who\u2019s going to be deploying the solution, as these networks have evolved, has primarily focused on how to feed into the network and how to get the inference on the side. Many of them are looking at how much runs on the NPU, how much runs on a GPU, how much runs on the DSP, and how much runs on the CPU. But when these networks are coming in every day and evolving every day, one of the big things we have seen is, yes, best PPA, best energy. But at the same time, can I feed it in from the left side, and get a result from the right side? All these factors play a very important role.<\/p>\n<p style=\"font-weight: 400;\">Chole: Based on my experience, I see how fast the model is changing. It\u2019s kind of a function of where the NPU sits in the pipeline. Is it close to the sensor versus close to the application? So for NPUs, which are close to the sensors, noise reduction applications don\u2019t necessarily change that often. There can be parameter changes, there can be small architectural changes, but not too often, because it\u2019s very closely tied to the sensor. Sensors don\u2019t change the workload, don\u2019t change the FPS, don\u2019t change the latency, so there\u2019s no requirement for the vendor side to be able to change that. But if you go toward applications, especially the control plane or any user interaction, they have a lot more flexibility. And as the models are evolving in the data centers or from academia, all those techniques, different quantization techniques, or different ways of optimizing the structure architecture of the model come in, that would have to be supported through the entire software and hardware stack. What we see as challenging is not really supporting new models. It\u2019s much harder to support new models with performance because the optimization techniques that have to be used might not always exist in that generation on the NPU. Then it is the race between what architectural changes are allowed versus how much benefit you can get while constraining the AI model architecture to the NPU or to the hardware possibilities. For instance, if hardware only supports a certain type of quantization, what can you get out of that? I do see that our customers understand that at a certain level, but the industry as a whole does not necessarily care for that, because the industry as a whole wants to always move ahead, and then we are always playing that catch-up game.<\/p>\n<p style=\"font-weight: 400;\">Lawley: There are two important models for a customer. The second most important model is the model that they can give us, the public one, which they will want to know the results of. But the first, most important model to them is the one they can\u2019t give out. It\u2019s their secret one, their secret sauce. And that\u2019s where it becomes incredibly important for the software and the compilers to be able to take that network that we can\u2019t see, that we don\u2019t have access to, and compile it and lower it down so that it works optimally on whatever the hardware it\u2019s going to run on. As you see these models evolving, the compilers being able to keep up with the evolution of the networks and the operators is incredibly challenging, incredibly important, and incredibly expensive. That\u2019s where a lot of IP companies will start to win out, because we can spread that software cost across lots of different customers, whereas for somebody developing their own custom accelerator, it gets a lot more difficult for them on the software side, not necessarily the hardware side.<\/p>\n<p style=\"font-weight: 400;\">Roddy: Good point. That\u2019s absolutely critical. No downstream OEM wants to be reliant upon the box builder or the chip builder, or, God forbid, the IP licenser, who are three levels removed to port a new model. The tooling has to be bulletproof. That\u2019s the direction of travel. Whatever your underlying architecture is, and I can argue all day long that heterogeneous architectures are a bad idea but it doesn\u2019t matter, the software has to be able to allow the car manufacturer, who\u2019s buying an ADAS system from a Tier One supplier, who bought a chip from a semiconductor supplier, who bought a processor core from Cadence or Expedera or Quadric or whomever, that automotive company with their data scientists have to be able to land their new, updated algorithm with high performance on whatever that high performance piece of the architecture is. If it\u2019s an accelerator, okay, fine, you have to be an accelerator. There\u2019s no, \u201cYeah, it runs, but it only runs on the CPU and runs at 1\/20th the speed.\u201d It has to run at speed, easily, without 12 layers of NDAs between the data scientist somewhere at the car company and the processor architect. That doesn\u2019t work. It has to push it out to the edge. And that\u2019s how all this has to happen. Whether it\u2019s an agentic thing in an industrial site or something in a car, we, the IP vendors in this small group of architects, can\u2019t be the pinch point for all these new models.<\/p>\n<p style=\"font-weight: 400;\">SE: With the flurry of activity around applying agentic AI, how does that excitement change the type or the frequency of the workloads that you\u2019re seeing?<\/p>\n<p style=\"font-weight: 400;\">Balasubramanian: What we are seeing with agentic AI is a lot of folks, some big, big companies, GPU providers, playing with the floating point position to trade off accuracy and being able to handle a lot more with the given memory, because the requirements are different in terms of orchestrations with the whole open cloud. There\u2019s quite a bit of experimentation happening. Also, the workload is increasing. There\u2019s a lot more orchestration, a lot more unknowns. How do we adapt the edge AI to do it? If they change the floating-point position in terms of model updates, is the IP flexible enough to handle that? Or do they need to change something very basic, even swap out the architecture or anything? How does that handle get handled?<\/p>\n<p style=\"font-weight: 400;\">Roddy: First level, we don\u2019t care. The whole agentic AI thing is fascinating because it represents a huge step function in the demand for inference, or demand for tokens. Up till now, for the most part, it\u2019s the human walking up and typing something, whether you\u2019re in a session with Claude on your PC, or you\u2019ve got an industrial boiler or an elevator shaft or something where a technician shows up and queries the machine that triggers the inference, that triggers the run. It triggers that demand for tokens to go figure something out. It\u2019s plausible to think that all edge AI could go back to the cloud for big inference when it was only triggered by human behavior. Suddenly, you\u2019ve got 24\/7\/365. You\u2019ve got some agent that you set up such that every minute we\u2019re going to run the monitoring on the elevator shaft, or we\u2019re going to listen to vibrations on our piece of equipment and run that model. That has to be fully self-contained on the edge. It just has to be. If you run a big factory, and you\u2019ve got a thousand things that are instrumented, if those thousand things are pumping out hundreds of thousands of queries a day out to the cloud and consuming tokens, you\u2019re not going to spend $10,000, $20,000, $30,000 a day for tokens monitoring your factory. It has to all happen locally. All these SLMs, all these VLAs, have to be fully self-contained. For those of us in the edge AI world, it means making sure that we can have flexible platforms. And it boosts the amount of silicon that you put in the device \u2014 more horsepower, more TOPS. It\u2019s one thing if you have a two-TOPS solution that does some local processing and goes back to the cloud twice a day, but you can\u2019t go back to the cloud 500 times a day. So you have to beef up the solution there, beef up the memory there, beef up what\u2019s happening in the device, and only go back to the cloud when an aberration occurs, not every single time something normal happens. It\u2019s going to be a fascinating change in the architecture. Will it mean more big data centers or fewer big data centers? That remains to be seen. It\u2019s one of those things where the total demand for tokens will just explode, and it will saturate the data centers. But it\u2019s also going to saturate what we have in our hands and our devices.<\/p>\n<p>Woo: What we are seeing with agentic AI is not just more inference, but longer-lived workloads that build up deeper contexts over time. That changes the hardware conversation from shorter-term, more ephemeral work into sustained efficiency, data movement, reliability and availability, and power management. As agents communicate with other agents, individual workloads will amplify, and memory capacity and bandwidth needs will also grow. For chip designers, this pushes architectures towards more efficient compute and data movement through tighter integration, and intelligent data handling through memory tiering beyond simply improving compute.<\/p>\n<p style=\"font-weight: 400;\">Chole: The workloads that we are seeing on agentic AI are very large in terms of tokens, so we have to break them down into two parts. First are the inputs that we are actually sending to the system prompt. Two to three years ago, we started fine-tuning the models. As the models started becoming larger and larger, the benefit of fine-tuning the models actually diminished compared to the prompt engineering. That means if you can write a good system prompt and a good user prompt, you actually get a huge benefit out of what you would like to do in terms of accuracy, rather than just fine-tuning the model for certain use cases. Based on my industry experience, you actually lose accuracy when you start fine-tuning the model, because you lose the generalization that is very critical for LLMs, so system prompts are getting larger. And just to give you the idea of the scale, they range from 4,000 to 5,000 tokens, to like 20,000 to 30,000 tokens. It varies based on the kind of application. I\u2019m talking about a server-class application here. For any agentic application, like code reviews, you want to build an architecture, you want to actually review documents, summarize documents, those kinds of things. Those are one-shot agent applications, purely one-shot, but because the ability of the LLMs has increased to comprehend a very complex task, nowadays, we have to type very little for AI to understand our intent. That wasn\u2019t the case six months ago. Six months ago, I had to write a complete PID to say what I needed to do. \u2018Don\u2019t deviate from this.\u2019 I don\u2019t have to do that anymore, because there\u2019s a much better training data set available for understanding human intent. That means we can give very large, complicated tasks to the agents while expecting them to actually perform them. And this generates a secondary thing, which is, like the thinking token increase, also the output tokens increase. So before, when we used to define a workload as maybe 1,000 or 2,000 tokens. That\u2019s no longer the case. We are talking about tens of thousands of tokens, and this determines where the application should reside. I\u2019m not 100% sure that an agentic application can decide on the edge. I don\u2019t even think it\u2019s going to benefit, in a certain sense, for them to be on the edge, if this is the workload class that we need to have. As an industry, we need to figure out what kind of agents should be on the edge. They cannot be the heavy-duty agents that data centers are using nowadays, because that\u2019s practically impossible. We cannot really run, for instance, my cell phone for two hours and get back, \u2018Hey, this is the answer to your question.\u2019 That\u2019s not what we are expecting. In terms of privacy sensitivity, in terms of latency sensitivity, we still have to figure out what kind of applications we need to run on the edge to be able to solve them.<\/p>\n<p style=\"font-weight: 400;\">Cooper:\u00a0 I look at agentic AI almost as a system-level problem from an NPU point of view. It\u2019s really about how well you can do this traditional perceptive AI, where you\u2019re dealing with the sensors, and how well you can pivot and do this compute, or this memory-bound challenge of large language models, VLAs, that will then support agentic AI. \u00a0It\u2019s not like I have customers asking, \u2018How well can your NPU run agentic AI?\u2019 They ask, \u2018How well can you generate your tokens per second?\u2019 Or, \u2018How well can you run these specific models?\u2019 It\u2019s an important thing, but it\u2019s not necessarily something from an NPU that we measure the performance of an agentic AI, because that\u2019s a system-level problem more than a specific NPU problem.<\/p>\n<p style=\"font-weight: 400;\">Lawley:\u00a0 If you think you understand how agentic AI will be used at the edge, then you don\u2019t understand agentic AI yet. There\u2019s going to be a lot of things that we don\u2019t see right now. As this evolves, this is going to be a big evolutionary step, and probably the next evolutionary step when it comes to inference at the edge will be this agentic step. How it all plays out always comes back to the three things. How much power does it consume? How much data movement does it need? How much compute will it need?<\/p>\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"Key Takeaways: Model development is outpacing silicon design cycles, so edge AI architectures must prioritize adaptability. The required&hellip;\n","protected":false},"author":2,"featured_media":13731,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[24,10594,5079,25,10595,3317,10596,10597,10598,10599,10600,9383],"class_list":{"0":"post-13730","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai","8":"tag-ai","9":"tag-architecting-ai-processors","10":"tag-arm","11":"tag-artificial-intelligence","12":"tag-cadence","13":"tag-edge-ai","14":"tag-expedera","15":"tag-mixel","16":"tag-quadric","17":"tag-rambus","18":"tag-siemens-eda","19":"tag-synopsys"},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/13730","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/comments?post=13730"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/13730\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media\/13731"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media?parent=13730"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/categories?post=13730"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/tags?post=13730"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}