{"id":209344,"date":"2025-06-24T03:29:10","date_gmt":"2025-06-24T03:29:10","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/209344\/"},"modified":"2025-06-24T03:29:10","modified_gmt":"2025-06-24T03:29:10","slug":"introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/209344\/","title":{"rendered":"Introducing Mu language model and how it enabled the agent in Windows Settings\u202f"},"content":{"rendered":"<p>We are excited to introduce our newest on-device small language model, Mu. This model addresses scenarios that require inferring complex input-output relationships and has been designed to operate efficiently, delivering high performance while running locally. Specifically, this is the language model that powers the <a href=\"https:\/\/blogs.windows.com\/windowsexperience\/2025\/05\/06\/introducing-a-new-generation-of-windows-experiences\/\" target=\"_blank\" rel=\"noopener\">agent in Settings<\/a>,\u202f available to <a href=\"https:\/\/blogs.windows.com\/windows-insider\/2025\/06\/13\/announcing-windows-11-insider-preview-build-26200-5651-dev-channel\/\" target=\"_blank\" rel=\"noopener\">Windows Insiders in the Dev Channel with Copilot+ PCs<\/a>, by mapping natural language input queries to Settings function calls.<\/p>\n<p>Mu is fully offloaded onto the Neural Processing Unit (NPU) and responds at over 100 tokens per second, meeting the demanding UX requirements of the agent in Settings scenario. This blog will provide further details on Mu\u2019s design and training and how it was fine-tuned to build the agent in Settings.<\/p>\n<p><strong>Model training Mu<\/strong><\/p>\n<p>Enabling <a href=\"https:\/\/blogs.windows.com\/windowsexperience\/2024\/12\/06\/phi-silica-small-but-mighty-on-device-slm\/\" target=\"_blank\" rel=\"noopener\">Phi Silica<\/a> to run on NPUs provided us with valuable insights about tuning models for optimal performance and efficiency. These informed the development of Mu, a micro-sized, task-specific language model designed from the ground up to run efficiently on NPUs and edge devices.<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-179775 size-large\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/EncDec-1024x795.png\" alt=\" Encoder-Decoder Architecture compared to Decoder-only Architecture.\" width=\"1024\" height=\"795\"\/>Encoder-Decoder Architecture compared to Decoder-only Architecture<\/p>\n<p>Mu is an efficient 330M encoder\u2013decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder\u2013decoder architecture, meaning an encoder first converts the input into a fixed-length latent representation, and a decoder then generates output tokens based on that representation.<\/p>\n<p>This design yields significant efficiency benefits. The figure above illustrates how an encoder-decoder reuses the input\u2019s latent representation whereas a decoder-only must consider the full input + output sequence. By separating the input tokens from output tokens, Mu\u2019s one-time encoding greatly reduces computation and memory overhead. In practice, this translates to lower latency and higher throughput on specialized hardware. For example, on a Qualcomm Hexagon NPU (a mobile AI accelerator), Mu\u2019s encoder\u2013decoder approach achieved about 47% lower first-token latency and 4.7\u00d7 higher decoding speed compared to a decoder-only model of similar size. These gains are crucial for on-device and real-time applications.<\/p>\n<p>Mu\u2019s design was carefully tuned for the constraints and capabilities of NPUs. This involved adjusting model architecture and parameter shapes to better fit the hardware\u2019s parallelism and memory limits. We chose layer dimensions (such as hidden sizes and feed-forward network widths) that align with the NPU\u2019s preferred tensor sizes and vectorization units, ensuring that matrix multiplications and other operations run at peak efficiency. We also optimized the parameter distribution between the encoder and decoder \u2013 empirically favoring a 2\/3\u20131\/3 split (e.g. 32 encoder layers vs 12 decoder layers in one configuration) to maximize performance per parameter.<\/p>\n<p>Additionally, Mu employs weight sharing in certain components to reduce the total parameter count. For instance, it ties the input token embeddings and output embeddings, so that one set of weights is used for both representing input tokens and generating output logits. This not only saves memory (important on memory-constrained NPUs) but can also improve consistency between encoding and decoding vocabularies.<\/p>\n<p>Finally, Mu restricts its operations to those NPU-optimized operators supported by the deployment runtime. By avoiding any unsupported or inefficient ops, Mu fully utilizes the NPU\u2019s acceleration capabilities. These hardware-aware optimizations collectively make Mu highly suited for fast, on-device inference.<\/p>\n<p><strong>Packing performance in a tenth the size<\/strong><br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-179777 size-full\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/InferenceTime.png\" alt=\"Graph showing two lines: Inference Time versus Input Sequence Length.\" width=\"1000\" height=\"600\"\/><\/p>\n<p>Mu adds three key transformer upgrades that squeeze more performance from a smaller model:<\/p>\n<ul>\n<li><strong>Dual LayerNorm (pre- and post-LN)<\/strong> \u2013 normalizing both before and after each sub-layer keeps activations well-scaled, stabilizing training with minimal overhead.<\/li>\n<\/ul>\n<ul>\n<li><strong>Rotary Positional Embeddings (RoPE)<\/strong> \u2013 complex-valued rotations embed relative positions directly in attention, improving long-context reasoning and allowing seamless extrapolation to sequences longer than those seen in training.<\/li>\n<\/ul>\n<ul>\n<li><strong>Grouped-Query Attention (GQA)<\/strong> \u2013 sharing keys \/ values across head groups slashes attention parameters and memory while preserving head diversity, cutting latency and power on NPUs.<\/li>\n<\/ul>\n<p>Training techniques such as warmup-stable-decay schedules and the Muon optimizer were used to further refine its performance. Together, these choices deliver stronger accuracy and faster inference within Mu\u2019s tight edge-device budget.<\/p>\n<p>We trained Mu using A100 GPUs on Azure Machine Learning, taking place over several phases. Following the techniques pioneered first in the development of the Phi models, we began with pre-training on hundreds of billions of the highest-quality educational tokens, to learn language syntax, grammar, semantics and some world knowledge.<\/p>\n<p>To continue to enhance accuracy, the next step was distillation from <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/phi\/?msockid=2e79684194f567391a487daf954f66a4\" target=\"_blank\" rel=\"noopener\">Microsoft\u2019s Phi models<\/a>. By capturing some of the Phi\u2019s knowledge, Mu models achieve remarkable parameter efficiency. All of this yields a base model that is well-suited to a variety of tasks \u2013 but pairing with task-specific data along with additional fine-tuning through low-rank adaption (LoRA) methods, can dramatically improve the performance of the model.<\/p>\n<p>We evaluated Mu\u2019s accuracy by fine-tuning on several tasks, including <a href=\"https:\/\/rajpurkar.github.io\/SQuAD-explorer\/\" target=\"_blank\" rel=\"noopener\">SQUAD<\/a>, <a href=\"https:\/\/microsoft.github.io\/CodeXGLUE\/\" target=\"_blank\" rel=\"noopener\">CodeXGlue<\/a>\u00a0and Windows Settings agent (which we will talk more about later in this blog). For many tasks, the task-specific Mu achieves remarkable performance despite its micro-size of a few hundred million parameters.<\/p>\n<p>When comparing Mu to a similarly fine-tuned Phi-3.5-mini, we found that Mu is nearly comparable in performance despite being one-tenth of the size, capable of handling tens of thousands of input context lengths and over a hundred output tokens per second.<\/p>\n<tr>\n<td width=\"33%\">Task \\ Model<\/td>\n<td width=\"33%\">Fine-tuned Mu<\/td>\n<td width=\"33%\">Fine-tuned Phi<\/td>\n<\/tr>\n<tr>\n<td width=\"33%\">SQUAD<\/td>\n<td width=\"33%\">0.692<\/td>\n<td width=\"33%\">0.846<\/td>\n<\/tr>\n<tr>\n<td width=\"33%\">CodeXGlue<\/td>\n<td width=\"33%\">0.934<\/td>\n<td width=\"33%\">0.930<\/td>\n<\/tr>\n<tr>\n<td width=\"33%\">Settings Agent<\/td>\n<td width=\"33%\">0.738<\/td>\n<td width=\"33%\">0.815<\/td>\n<\/tr>\n<p><strong>Model quantization and model optimization<\/strong><\/p>\n<p>To enable the Mu model to run efficiently on-device, we applied advanced model quantization techniques tailored to NPUs on Copilot+ PCs.<\/p>\n<p>We used Post-Training Quantization (PTQ) to convert the model weights and activations from floating point to integer representations \u2013 primarily 8-bit and 16-bit. PTQ allowed us to take a fully trained model and quantize it without requiring retraining, significantly accelerating our deployment timeline and optimizing for efficiently running on Copilot+ devices. Ultimately, this approach preserved model accuracy while drastically reducing memory footprint and compute requirements without impacting the user experience.<\/p>\n<p>Quantization was just one part of the optimization pipeline. We also collaborated closely with our silicon partners at AMD, Intel and Qualcomm to ensure that the quantized operations when running Mu were fully optimized for the target NPUs. This included tuning mathematical operators, aligning with hardware-specific execution patterns and validating performance across different silicon. The optimization steps result in highly efficient inferences on edge devices, producing outputs at more than 200 tokens\/second on a Surface Laptop 7.<\/p>\n<p>\t\t\t<a href=\"https:\/\/www.youtube.com\/watch?si=P1nIObhNUVckI7yl&amp;v=A2geTQes0Pw&amp;feature=youtu.be\" class=\"c-video__trigger\" data-js=\"c-video-trigger\" title=\"Play Video\" target=\"_blank\" rel=\"noopener\" aria-label=\"Click to Play Video\"><br \/><img decoding=\"async\" class=\"c-video__image lazyload\" src=\"https:\/\/blogs.windows.com\/wp-content\/themes\/microsoft-stories-theme\/img\/theme\/shims\/16x9.png\" data-src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/1750735749_554_maxresdefault.jpg\" role=\"presentation\"\/><\/p>\n<p><\/a><\/p>\n<p style=\"text-align: center;\">Mu running a question-answering task on an edge device, using context sourced from Wikipedia: (https:\/\/en.wikipedia.org\/wiki\/Microsoft)<\/p>\n<p>Notice the fast token throughputs and ultra-fast time to first token responses despite the large amount of input context provided to the model.<\/p>\n<p>By pairing state-of-the-art quantization techniques with hardware-specific optimizations, we ensured that Mu is highly effective for real-world deployments on resource-constrained applications. In the next section, we go into detail on how Mu was fine-tuned and applied to build the new Windows agent in Settings on Copilot+ PCs.<\/p>\n<p><strong>Model tuning the agent in Settings<\/strong><\/p>\n<p>To improve Windows\u2019 ease of use, we focused on addressing the challenge of changing hundreds of system settings. Our goal was to create an AI-powered agent within Settings that understands natural language and changes relevant undoable settings seamlessly. We aimed to integrate this agent into the existing search box for a smooth user experience, requiring ultra-low latency for numerous possible settings. After testing various models, Phi LoRA initially met precision goals but was too large to meet latency targets. Mu, with the right characteristics, required task-specific tuning for optimal performance in Windows Settings.<\/p>\n<p>While baseline Mu in this scenario excelled in terms of performance and power footprint, it incurred a 2x precision drop using the same data without any fine-tuning.\u202f To close the gap, we scaled training to 3.6M samples (1300x) and expanded from roughly 50 settings to hundreds of settings. By employing synthetic approaches for automated labelling, prompt tuning with metadata, diverse phrasing, noise injection and smart sampling, the Mu fine-tune used for Settings Agent successfully met our quality objectives. The Mu model fine-tune achieved response times of under 500 milliseconds, aligning with our goals for a responsive and reliable agent in Settings that scaled to hundreds of settings. The below image shows how the experience is integrated with an example showing the mapping from a natural use language query to a Settings action being surfaced by the UI.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-179778 size-large\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/image-1024x585.png\" alt=\"Screenshot demonstrating the agent in Settings\u202f.\" width=\"1024\" height=\"585\"\/>Screenshot demonstrating the agent in Settings<\/p>\n<p>To further address the challenge of short and ambiguous user queries, we curated a diverse evaluation set combining real user inputs, synthetic queries and common settings, ensuring the model could handle a wide range of scenarios effectively. We observed that the model performed best on multi-word queries that conveyed clear intent, as opposed to short or partial-word inputs, which often lack sufficient context for accurate interpretation. To address this gap, the agent in Settings is integrated into the Settings search box, enabling short queries that don\u2019t meet the multi-word threshold to continue to surface lexical and semantic search results in the search box, while allowing multi-word queries to surface the agent to return high precision actionable responses.\u202f\u00a0<\/p>\n<p>Managing the extensive array of Windows settings posed its own challenges, particularly with overlapping functionalities. For instance, even a simple query like \u201cIncrease brightness\u201d could refer to multiple settings changes \u2013 if a user has dual monitors, does that mean increasing brightness to the primary monitor or a secondary monitor?<\/p>\n<p>To address this, we refined our training data to prioritize the most used settings as we continue to refine the experience for more complex tasks.<\/p>\n<p><strong>What\u2019s ahead<\/strong><\/p>\n<p>We welcome feedback from users in the Windows Insiders program as we continue to refine the experience for the agent in Settings.<\/p>\n<p>As we\u2019ve shared in our previous blogs, these breakthroughs wouldn\u2019t be possible without the support of efforts from the Applied Science Group and our partner teams in WAIIA and WinData that contributed to this work, including: Adrian Bazaga, Archana Ramesh, Carol Ke, Chad Voegele, Cong Li, Daniel Rings, David Kolb, Eric Carter, Eric Sommerlade, Ivan Razumenic, Jana Shen, John Jansen, Joshua Elsdon, Karthik Sudandraprakash, Karthik Vijayan, Kevin Zhang, Leon Xu, Madhvi Mishra, Mathew Salvaris, Milos Petkovic, Patrick Derks, Prateek Punj, Rui Liu, Sunando Sengupta, Tamara Turnadzic, Teo Sarkic, Tingyuan Cui, Xiaoyan Hu, Yuchao Dai.<\/p>\n","protected":false},"excerpt":{"rendered":"We are excited to introduce our newest on-device small language model, Mu. This model addresses scenarios that require&hellip;\n","protected":false},"author":2,"featured_media":209345,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-209344","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/209344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=209344"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/209344\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/209345"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=209344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=209344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=209344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}