{"id":481852,"date":"2026-05-13T02:28:18","date_gmt":"2026-05-13T02:28:18","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/481852\/"},"modified":"2026-05-13T02:28:18","modified_gmt":"2026-05-13T02:28:18","slug":"defense-at-ai-speed-microsofts-new-multi-model-agentic-security-system-tops-leading-industry-benchmark","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/481852\/","title":{"rendered":"Defense at AI speed: Microsoft\u2019s new multi-model agentic security system tops leading industry benchmark"},"content":{"rendered":"<p>\t\tIn this article<\/p>\n<p class=\"wp-block-paragraph\">Today Microsoft announced a major step forward in AI-powered cyber defense: our new agentic security system helped researchers find 16 new vulnerabilities across the Windows networking and authentication stack\u2014including four Critical remote code execution flaws in components such as the Windows kernel TCP\/IP stack and the IKEv2 service. They used the new Microsoft Security <strong>m<\/strong>ulti-mo<strong>d<\/strong>el <strong>a<\/strong>gentic <strong>s<\/strong>canning <strong>h<\/strong>arness (codename MDASH) which was built by Microsoft\u2019s Autonomous Code Security team. Unlike single-model approaches, the harness orchestrates more than 100 specialized AI agents across an ensemble of frontier and distilled models to discover, debate, and prove exploitable bugs end-to-end.<\/p>\n<p class=\"wp-block-paragraph\">The results speak for themselves: 21 of 21 planted vulnerabilities found with zero false positives on a private test driver; 96% recall against five years of confirmed Microsoft Security Response Center (MSRC) cases in clfs.sys and 100% in tcpip.sys; and an industry-leading 88.45% score on the public CyberGym benchmark of 1,507 real-world vulnerabilities\u2014the top score on the leaderboard, roughly five points ahead of the next entry. <\/p>\n<p class=\"wp-block-paragraph\">The strategic implication is clear: AI vulnerability discovery has crossed from research curiosity into production-grade defense at enterprise scale, and the durable advantage lies in the agentic system around the model rather than any single model itself. Codename MDASH is being used by Microsoft security engineering teams and tested by a small set of customers as part of a limited private preview.<\/p>\n<p class=\"wp-block-paragraph\">This post explains how\u00a0codename MDASH\u00a0works, what we shipped today, what we learned along the way,\u00a0and how you can sign up for the\u00a0private\u00a0preview.\u00a0\u00a0<\/p>\n<p>AI-powered\u00a0vulnerability discovery at hyper-scale<\/p>\n<p class=\"wp-block-paragraph\">The Microsoft <strong>Autonomous Code Security (ACS)<\/strong> team was assembled to take AI-powered vulnerability research from a research curiosity to production engineering at enterprise scale. Several members of this team came to Microsoft from Team Atlanta, the team that won the $20 million DARPA AI Cyber Challenge by building an autonomous cyber-reasoning system that found and patched real bugs in complex open-source projects. The lessons from that work, especially the level of engineering required to make the frontier language models perform professional-level security auditing, are what our new multi-model agentic scanning harness (codename MDASH) is built around.<\/p>\n<p class=\"wp-block-paragraph\">Microsoft\u2019s code base\u00a0is\u00a0challenging for security auditing for a few reasons:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Massive proprietary surface.<\/strong>\u00a0Windows, Hyper-V, Azure, and the device-driver and service ecosystems around them are private Microsoft codebases\u2014not part of any commodity\u00a0language model\u2019s\u00a0training corpus, and genuinely hard to reason about: kernel calling conventions, IRP and lock invariants, IPC trust boundaries, and component-internal idioms do not yield to pattern matching.\u00a0On this surface, a model has to actually reason.\u00a0<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>DevSecOps\u00a0at scale.<\/strong>\u00a0Every finding has a real owner, a triage process, and a Patch Tuesday to land on. There is no quiet drawer for speculative findings; if a tool produces noise, the noise is everyone\u2019s problem.\u00a0<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>High-value targets.<\/strong>\u00a0Windows, Hyper-V, Xbox, and Azure serve billions of users. The payoff for finding a single hard bug is unusually high\u2014and so is the cost of a false positive in a tier-one\u00a0component.\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The findings in this post are the result of close collaboration between <strong>ACS <\/strong>and <strong>Microsoft Windows Attack Research and Protection (WARP)<\/strong>. WARP owns the deep, hard end of Windows offensive research; ACS brings the AI-powered discovery and validation pipeline. Together, the teams have collaborated to build a mature harness.<\/p>\n<p>Codename: MDASH\u2014Microsoft Security\u2019s new multi-model agentic  scanning harness<\/p>\n<p class=\"wp-block-paragraph\">Codename MDASH is, at its core, an <strong>agentic vulnerability discovery and remediation system<\/strong>. The model is one input. The system is the product.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"900\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2026\/05\/Picture2.svg\" alt=\"Diagram of an automated code security workflow showing stages from repository analysis and code scanning to bug triage, proof-of-concept generation, and automated patch creation and validation.\" class=\"wp-image-147319\"\/><\/p>\n<p class=\"wp-block-paragraph\">A useful mental model is to think of it as a structured pipeline that takes a code base and emits validated, proven findings:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Prepare\u00a0stage<\/strong>:\u00a0Ingests the\u00a0source\u00a0target, builds language-aware indices,\u00a0and then\u00a0draws\u00a0the attack surface and threat models by analyzing the past commits.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Scan\u00a0stage<\/strong>:\u00a0Runs\u00a0specialized auditor agents over candidate code paths, emitting candidate findings with hypotheses and evidence.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Validate\u00a0stage<\/strong>:\u00a0\u00a0Runs a second cohort of agents\u2014debaters\u2014that argue for and against each finding\u2019s reachability and exploitability.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Dedup\u00a0stage<\/strong>:\u00a0Collapses\u00a0semantically equivalent findings\u00a0(for example, patch-based grouping).\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Prove\u00a0stage<\/strong>:\u00a0Constructs and executes triggering inputs where the bug class admits it. The prove stage\u00a0validates\u00a0the pre-condition dynamically and formulates the bug-triggering inputs\u00a0to prove existence of vulnerability (for example,\u00a0ASan\u00a0in C\/C++).\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Three properties make this work\u00a0in\u00a0practice:\u00a0\u00a0<\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>An ensemble of diverse models that are effectively managed by codename MDASH<\/strong>. No single model is best at every stage. The multi-model agentic scanning harness runs a configurable panel of models. That includes SOTA models as the heavy reasoner, <strong>distilled models<\/strong> as a cost-effective debater for high-volume passes, and a <strong>second separate SOTA model<\/strong> as an independent counterpoint. Disagreement between models is itself a signal: when an auditor flags something as suspect and the debater can\u2019t refute it, that finding\u2019s posterior credibility goes up.<\/li>\n<\/ol>\n<ol start=\"2\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Specialized agents<\/strong>. An auditor does not reason like a debater, which does not reason like a prover. Each pipeline stage has its own role, prompt regime, tools, and stop criteria. We don\u2019t expect one prompt to do everything; we don\u2019t expect one agent to recognize, validate, and exploit a bug in a single pass. Codename MDASH has more than 100 specialized agents, constructed through deep research with past common vulnerabilities and exposures (CVEs) and their patches, working independently to discover the bugs, and their auditing results will be ensembled as a single report.<\/li>\n<\/ol>\n<ol start=\"3\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>End-to-end pipeline with extensible plugins.<\/strong>\u00a0The pipeline is opinionated, but it is not closed. Plugins let domain experts inject context the foundation models can\u2019t see\u00a0on their own\u2014kernel calling conventions, IRP rules, lock invariants, IPC trust boundaries, codec state machines. The CLFS proving plugin we describe below is one such example: a domain plugin that knows how to construct a triggering log file given a candidate finding.\u00a0For example,\u00a0the\u00a0Windows team extended\u00a0reasoning with custom code analysis database, or\u00a0CodeQL\u00a0database can\u00a0be also\u00a0leveraged.\u00a0<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">The payoff for this architecture is\u00a0<strong>portability across model generations<\/strong>. The pipeline\u2019s targeting, validation,\u00a0dedup, and prove stages are model\u00a0agnostic by construction,\u00a0which allows the\u00a0harness to get\u00a0the best of what any model has to offer. When a new model lands, A\/B testing it against the current panel is one configuration flip. When a model improves, the customer\u2019s prior investment\u2014scope files, plugins, configurations,\u00a0calibrations\u2014all\u00a0carry\u00a0over,\u00a0allowing customers to ride the frontier of security value.\u00a0\u00a0<\/p>\n<p>Using\u00a0codename\u00a0MDASH\u00a0for security research<\/p>\n<p class=\"wp-block-paragraph\">To evaluate bug-finding capabilities of the multi-model agentic scanning harness you need to first ground on code that has never been seen by a model. This eliminates the possibility that a model \u201clearned the answers to the test.\u201d We scanned StorageDrive, a sample device driver used in Microsoft interviews for offensive security researchers. The driver contains 21 deliberately injected vulnerabilities, including kernel use-after-frees (UAFs), integer handling issues, IOCTL validation gaps, and locking errors. Because StorageDrive is a private codebase that has never been published, we can safely assume it was not included in the training data of modern language models.<\/p>\n<p class=\"wp-block-paragraph\">We ran the harness on StorageDrive using its default configuration. The results were striking: all 21 ground-truth vulnerabilities were correctly identified, with zero false positives in this run.<\/p>\n<p class=\"wp-block-paragraph\">This simple test shows that the reasoning and vulnerability discovery capabilities of codename MDASH can approximate professional offensive researchers.<\/p>\n<p class=\"wp-block-paragraph\">We then use the harness to conduct security auditing of the most security-critical part of Windows, namely, TCP\/IP network stack.<\/p>\n<p>The 5.12.2026\u00a0Patch Tuesday cohort<\/p>\n<p class=\"wp-block-paragraph\">Across the Windows network stack and adjacent services, today\u2019s Patch Tuesday includes 16 CVEs our engineering teams\u00a0found using codename MDASH.<\/p>\n<tr><strong>Component<\/strong><strong>Description<\/strong>CVESeverityType<\/tr>\n<tr>\n<td><strong>tcpip.sys<\/strong><\/td>\n<td>Remote\u00a0unauth\u00a0<br \/>SSRR IPv4 packets causing UAF\u00a0<\/td>\n<td>CVE-2026-33827\u00a0<\/td>\n<td>Critical\u00a0<\/td>\n<td>Remote Code Execution<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>NULL\u00a0deref\u00a0via crafted IPv6 extension headers<\/td>\n<td>CVE-2026-40413\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0(DoS)<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>Kernel DoS via ESP SA refcount underflow<\/td>\n<td>CVE-2026-40405\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>ikeext.dll\u00a0<\/strong><\/td>\n<td>Unauth IKEv2 SA_INIT double-free triggers LocalSystem RCE<\/td>\n<td>CVE-2026-33824\u00a0<\/td>\n<td>Critical\u00a0<\/td>\n<td>Remote Code Execution\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>Use-after-free in Ipv4pReassembleDatagram leading to disclosure\u00a0<\/td>\n<td>CVE-2026-40406\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Information Disclosure\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>IPsec cross-SA fragment splicing via reassembly\u00a0<\/td>\n<td>CVE-2026-35422\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Security Feature Bypass\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>Unauthenticated local Windows Filtering Platform (WFP) RPC disables name cache\u00a0<\/td>\n<td>CVE-2026-32209\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Security Feature Bypass\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>ikeext.dll\u00a0<\/strong><\/td>\n<td>Memory leak\u00a0<\/td>\n<td>CVE-2026-35424\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>telnet.exe\u00a0\u00a0<\/strong><\/td>\n<td>Out-of-bounds (OOB) read in FProcessSB via malformed TO_AUTH<\/td>\n<td>CVE-2026-35423\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Information Disclosure\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>IPv6+TCP MDL-split packet triggers NULL deref<\/td>\n<td>CVE-2026-40414\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>ICMPv6 packet triggers\u00a0NdisGetDataBuffer\u00a0NULL\u00a0<br \/>deref\u00a0<\/td>\n<td>CVE-2026-40401\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>Pre-auth remote UAF via SA double-decrement<\/td>\n<td>CVE-2026-40415\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Remote Code Execution\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>http.sys\u00a0<\/strong><\/td>\n<td>Unauth remote QUIC control-stream OOB read<\/td>\n<td>CVE-2026-33096\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Denial of Service\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>tcpip.sys\u00a0<\/strong><\/td>\n<td>Kernel stack buffer overflow via RPC blob<\/td>\n<td>CVE-2026-40399\u00a0<\/td>\n<td>Important\u00a0<\/td>\n<td>Elevation of Privilege\u00a0<\/td>\n<\/tr>\n<tr>\n<td><strong>netlogon.dll\u00a0<\/strong><\/td>\n<td>Unauthenticated CLDAP User= filter stack overflow<\/td>\n<td>CVE-2026-41089\u00a0<\/td>\n<td>Critical\u00a0<\/td>\n<td>Remote Code Execution\u00a0<\/td>\n<\/tr>\n<tr>\n<td>dnsapi.dll<\/td>\n<td>Crafted UDP DNS response triggers heap OOB<\/td>\n<td>CVE-2026-41096\u00a0<\/td>\n<td>Critical\u00a0<\/td>\n<td>Remote Code Execution\u00a0<\/td>\n<\/tr>\n<p class=\"wp-block-paragraph\">These vulnerabilities are 10 kernel-mode \/ 6 usermode. The majority are reachable from a network position with no credentials. Let\u2019s take a closer look.<\/p>\n<p>Two deep dives<\/p>\n<p class=\"wp-block-paragraph\">The two findings below are characteristic of what the new Microsoft Security <strong>m<\/strong>ulti-mo<strong>d<\/strong>el\u00a0<strong>a<\/strong>gentic\u00a0<strong>s<\/strong>canning\u00a0<strong>h<\/strong>arness pipeline can do that a single model harness cannot. The first is a kernel race-condition use-after-free that requires reasoning about object lifetime across non-trivial control flow and three independent concurrent free paths. The second is an alias-aliasing double-free that spans six source files and is only visible against the contrast of a correctly handled site elsewhere in the same code base.<\/p>\n<p>CVE-2026-33827\u2014Remote unauthenticated UAF in tcpip.sys via SSRR<\/p>\n<p class=\"wp-block-paragraph\">The vulnerability arises in the Windows IPv4 receive path due to improper lifetime management of a reference-counted Path object within Ipv4pReceiveRoutingHeader. After invoking a routing lookup, the function drops its sole owned reference to the Path through a dereference operation, but later reuses the same pointer when handling Strict Source and Record Route (SSRR) processing. Because the object\u2019s reference count might reach zero at the earlier release point, the underlying memory can be returned to a per-processor lookaside allocator and subsequently reused, turning the later access into a classical use-after-free in kernel context.<\/p>\n<p class=\"wp-block-paragraph\">This occurs on a network-triggerable path that processes attacker-controlled packet metadata, making it reachable at elevated IRQL within the networking stack. The core issue is escalated by the concurrency model of the path cache and associated cleanup routines. Once the caller relinquishes ownership, the Path object\u2019s liveness depends entirely on external references held by shared data structures. Multiple independent subsystems\u2014including the path-cache scavenger, explicit flush routines, and interface state-driven garbage collection\u2014can concurrently remove the object and drop the final reference. These operations are not synchronized with the receive-side execution window in this function, and no lock is held to serialize access. As a result, on SMP systems the freed object can be reclaimed and overwritten before the subsequent dereference, converting a simple ordering bug into a race-driven use-after-free with real execution feasibility.<\/p>\n<p class=\"wp-block-paragraph\">From an exploitation standpoint, the vulnerability is reachable by a remote, unauthenticated attacker through crafted IPv4 packets carrying the SSRR option that pass standard validation checks. The stale pointer dereference can trigger a chain of access through freed memory, potentially leading to controlled reads and a stronger corruption primitive if the reclaimed allocation is attacker-influenced. Although exploitation requires winning a narrow timing window and shaping allocator reuse, the combination of remote reachability, kernel execution context, and the potential for controlled memory manipulation elevates the issue to Critical severity.<\/p>\n<p><strong>Why\u00a0single-model\u00a0systems\u00a0missed\u00a0this\u00a0bu<\/strong>g<\/p>\n<p class=\"wp-block-paragraph\">A single model harness tends to miss this bug because the lifetime violation is not locally visible even within the same function. The release of the Path reference and its later reuse are separated by non-trivial control flow\u2014an alternate branch, multiple validation checks, and several early-drop conditions\u2014which break the straightforward \u201crelease-then-use\u201d pattern most detectors rely on. Without tracking reference ownership across these intermediate states, the model sees two independent operations rather than a temporal dependency. As a result, the dereference does not look suspicious in isolation, even though the reference count semantics guarantee the pointer might already be invalid.<\/p>\n<p class=\"wp-block-paragraph\">The decisive signal also lives outside the immediate context. The same logical operation appears elsewhere with the correct order; all needed data is derived from the object before dropping the reference. This makes this call-site an inconsistency rather than an obvious misuse.<\/p>\n<p class=\"wp-block-paragraph\">Detecting that requires cross-file reasoning: identifying analogous patterns, aligning their intent, and noticing the deviation. On top of that, reachability depends on composing multiple conditions\u2014an input that sets the SSRR flag, default configuration that allows the path, and concurrent subsystems that can reclaim the object during the exposed window. A single-shot analysis collapses these steps and loses the interaction between them, whereas a staged approach can connect the ownership violation, the concurrency model, and the externally controlled trigger into a coherent exploitation path.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Disclosure<\/strong>.\u00a0CVE-2026-33827, patched in\u00a0April\u00a0Patch Tuesday.\u00a0<\/p>\n<p>CVE-2026-33824: Unauthenticated IKEv2 SA_INIT + fragmentation \u2192 double-free \u2192 LocalSystem RCE<\/p>\n<p class=\"wp-block-paragraph\">The vulnerability lived in the IKEEXT service, the Windows component responsible for IKE and AuthIP keying for IPsec, and was reachable by a remote, unauthenticated attacker over UDP\/500 on any host configured as an IKEv2 responder (RRAS VPN, DirectAccess, Always-On VPN infrastructure, or any machine with an inbound connection security rule). By sending a crafted IKE_SA_INIT carrying Microsoft\u2019s \u201cIPsec Security Realm Id\u201d vendor-ID payload, followed by a single IKEv2 fragment (RFC 7383 SKF) that reassembles immediately, an attacker could trigger a deterministic double-free of a 16-byte heap allocation inside the service. <\/p>\n<p class=\"wp-block-paragraph\">Because IKEEXT runs as LocalSystem inside svchost.exe, this represents a pre-authentication remote code execution path into one of the highest-privilege contexts on the system. The root cause is a textbook ownership bug. When IKEEXT reinjects a reassembled fragment back through its receive pipeline, it duplicates the packet\u2019s receive context with a flat memcpy. This is a shallow copy: it clones the struct\u2019s bytes but not the heap allocations it points to. One of those allocations is the attacker-supplied security-realm identifier, and after the copy, both the queued context and the live Main Mode SA hold the same pointer, and both believe they own it. <\/p>\n<p class=\"wp-block-paragraph\">On teardown, each one frees it, resulting in a double-free. The trigger sequence is two UDP packets, no race, no special timing. The IKEEXT service runs as LocalSystem in svchost.exe. A double-free of a fixed-size heap chunk is a well-understood corruption primitive in modern Windows; we are not publishing further exploitation details. Reachability requires that the host has an IKEv2 responder policy that accepts the proposed transforms\u2014the bug is reachable on RRAS VPN, DirectAccess, Always-On VPN, and IPsec connection security rules in their typical configurations, but a bare Start-Service IKEEXT with no responder policy is not vulnerable. The IKEEXT service is DEMAND_START by default; where responder policy exists, BFE will start it on the first inbound IKE packet, so the attacker does not need IKEEXT to already be running.<\/p>\n<p><strong>Why\u00a0single-model\u00a0systems\u00a0missed\u00a0this\u00a0bug<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The bug is an aliasing lifecycle bug spanning six files: ike_A.c (the bad memcpy), ike_B.c (the alias origin and the first stack-local copy), ike_C.c (the wrong free), ike_D.c (both the right pattern and the second free), ike_E.c (where the buffer gets populated remotely), and ike_F.c (the IKEv2 dispatcher and the UAF read site that precedes the second free). No single-file analysis sees it. The strongest piece of evidence that the bug is real is the correct version of the same pattern, in the same code base, in ike_D.c\u2014immediately after the memcpy of the selector. Catching this requires the auditor to recognize the missing step at one site by reference to the present step at another. Our specialized auditor agents are designed to surface exactly these comparisons; the debate stage forces them to stand up under cross-examination.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Disclosure.<\/strong>\u00a0CVE-2026-33824, patched in\u00a0April\u00a0Patch Tuesday.\u00a0\u00a0\u00a0<\/p>\n<p>How capable is\u00a0codename\u00a0MDASH?<\/p>\n<p class=\"wp-block-paragraph\">The Patch Tuesday cohort and the\u00a0StorageDrive\u00a0are forward-looking signals. Two retrospective benchmarks tell us how the system performs against ground truth on real, well-reviewed code.\u00a0\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong>Recall on historical MSRC cases.<\/strong>\u00a0We re-ran\u00a0codename MDASH\u00a0against\u00a0pre-patch snapshots of two heavily reviewed Windows components and measured whether the historical MSRC-confirmed bugs would have been (re-)discovered:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">clfs.sys: <strong>96% recall on\u00a028\u00a0MSRC cases<\/strong>\u00a0spanning\u00a0five\u00a0years.\u00a0<\/li>\n<li class=\"wp-block-list-item\">tcpip.sys:\u00a0<strong>100% recall on 7 MSRC cases\u00a0<\/strong>spanning\u00a0five\u00a0years.\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These are the strongest internal numbers we publish, and they are meaningful for a specific reason: the MSRC case database is the ground truth for what real attackers exploited, what required a Patch Tuesday, and what defenders had to react to. A system that recovers 96% of a five-year MSRC backlog in a\u00a0heavily reviewed\u00a0kernel\u00a0component\u00a0is not finding theoretical weaknesses; it is finding the bugs that mattered.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We are deliberate about what these numbers do and do not claim. They are\u00a0retrospective recall\u00a0benchmarks on internal code with a finite case count. They tell us that the system would have been useful had it existed at the time. They do not, by themselves, predict that the next 38 bugs in CLFS will be found at the same rate. The forward-looking signal is the Patch Tuesday\u00a0cohort itself.\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong>The CLFS proving extension as a worked example<\/strong>. The 96% CLFS recall number is in part a story about the prove stage. Many CLFS findings look interesting until you try to construct a triggering log file; a candidate finding without a proof is, in practice, an entry on a triage backlog. The CLFS-specific proving plugin we wrote knows how to construct triggering logs given a candidate finding: it understands the on-disk container layout, the block-validation sequence, and the in-memory state machine well enough to drive a candidate path to its sink. This is precisely what plugin extensibility is for: the foundation models do not, and should not be expected to, internalize Microsoft-specific filesystem invariants. The plugin embeds them, the model uses them, and the outcome is bugs that survive being proven, not bugs that get filed and forgotten.<\/p>\n<p class=\"wp-block-paragraph\"><strong>CyberGym<\/strong>. On the public CyberGym benchmark\u2014a corpus of 1,507 real-world vulnerability reproduction tasks drawn from across 188 OSS-Fuzz projects\u2014the Microsoft Security multi-model agentic scanning harness reaches an 88.45% success rate, the highest score on CyberGym\u2019s published leaderboard at the time of writing and roughly five points above the next entry, 83.1%. This result was obtained by using generally available models. The strong results suggest that the surrounding agentic system contributes substantially to end-to-end performance, beyond raw model capability. For evaluation, we used CyberGym\u2019s default configuration (level 1), which provides the vulnerable source code and a high-level vulnerability description. To interface with CyberGym\u2019s evaluation protocol, we extended the harnesses prove stage to autonomously submit proof-of-concept (PoC) inputs and retrieve flags.<\/p>\n<p class=\"wp-block-paragraph\">Our failure analysis of the remaining roughly 12% reveals two notable structural patterns: among findings that targeted the wrong code area, 82% came from tasks with vague descriptions that also lacked function or file identifiers, suggesting that description quality is a major factor in scan accuracy. We also found cases where the agent constructed libFuzzer-style inputs, but the benchmark task actually required honggfuzz-format inputs, leading to otherwise sound reproductions failing on harness-format mismatch.<\/p>\n<p>What this\u00a0all\u00a0means<\/p>\n<p class=\"wp-block-paragraph\">We are at a moment in the industry where AI-powered vulnerability discovery stops being speculative and starts being an engineering problem. The findings in this Patch Tuesday and the retrospective recall on five years of CLFS MSRC cases are evidence that AI vulnerability findings can scale.<\/p>\n<p class=\"wp-block-paragraph\">What we have learned building MDASH and using it across Microsoft is more portable: <strong>the harness does the work, and the model is one input<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">This matters in three concrete ways.<\/p>\n<p class=\"wp-block-paragraph\"><strong>First, discovery requires composition that no single prompt can achieve<\/strong>. The bugs in this post\u2014the tcpip.sys race, the ikeext.dll alias chain\u2014are not visible to a model handed a single function. They are visible to a system that can sequence cross-file pattern comparison, multi-step reachability analysis, debate between specialized agents, and end-to-end proof construction. Single-model harnesses undersold what models can do; over-trusted single agents overshoot what models can do reliably. The art is the harness around the model, and the harness is most of the engineering.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Second, validation is the difference between a finding and a fix<\/strong>. A scanner that flags candidate bugs is a scanner that produces a triage backlog. The Patch Tuesday cohort is what it is because the system that produced it does not stop at candidate\u2014it debates, dedups, and proves. Validation is not a checkbox; it is its own pipeline of agents and plugins, and it is where most of the day-over-day engineering ends up.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Third, the system absorbs model improvements<\/strong>, which is what makes it durable. When a new model lands, the targeting, debating, dedup, and proof stages do not need to be rewritten; we change a configuration and re-run an A\/B test. The customer\u2019s investment\u2014per-project context, scan plugins, proving agents\u2014carries over. This is the architectural property that matters most over time, because the model lottery is going to keep playing out, and any system whose value is gated on a particular model is a system that has to be rebuilt every six months.<\/p>\n<p class=\"wp-block-paragraph\">For defenders\u2014at any scale, on any code they own\u2014the implication is the same. The right question to ask of an AI vulnerability tool is not which model does it use? but what does it do <strong>with <\/strong>the model, and what survives when the next model arrives?<\/p>\n<p>Conclusion<\/p>\n<p class=\"wp-block-paragraph\">The Microsoft Security <strong>m<\/strong>ulti-mo<strong>d<\/strong>el <strong>a<\/strong>gentic <strong>s<\/strong>canning <strong>h<\/strong>arness (codename MDASH) is helping our engineering teams meaningfully improve security outcomes using generally available AI models\u2014today. It is also being tested by customers as part of our limited private preview. To join the private preview, please <a href=\"https:\/\/aka.ms\/AI-drivenScanningHarness\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">sign up here<\/a>. <\/p>\n<p class=\"wp-block-paragraph\">Many thanks to the teams\u00a0across Microsoft working to improve the security of our customers, including the\u00a0<strong>Autonomous Code Security<\/strong>\u00a0team and the <strong>Microsoft\u00a0Windows Attack Research &amp; Protection\u00a0(WARP)<\/strong>\u00a0whose work led to the findings in this post.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We look forward to sharing more updates with customers and the industry as we work to make the world a safer place for all.\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"In this article Today Microsoft announced a major step forward in AI-powered cyber defense: our new agentic security&hellip;\n","protected":false},"author":2,"featured_media":481853,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[261],"tags":[291,289,290,18,19,17,82],"class_list":{"0":"post-481852","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-eire","12":"tag-ie","13":"tag-ireland","14":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/116564912635208432","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/481852","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=481852"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/481852\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/481853"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=481852"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=481852"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=481852"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}