{"id":30143,"date":"2026-05-06T21:36:10","date_gmt":"2026-05-06T21:36:10","guid":{"rendered":"https:\/\/www.europesays.com\/ai\/30143\/"},"modified":"2026-05-06T21:36:10","modified_gmt":"2026-05-06T21:36:10","slug":"validating-agentic-behavior-when-correct-isnt-deterministic","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ai\/30143\/","title":{"rendered":"Validating agentic behavior when \u201ccorrect\u201d isn\u2019t deterministic"},"content":{"rendered":"<p>Modern software testing is built on a fragile assumption: correct behavior is repeatable. For deterministic code, that assumption mostly holds. But for autonomous agents like Github Copilot Coding Agent (aka Agent Mode), especially as we explore the frontiers of integrated \u201cComputer Use,\u201d that assumption breaks down almost immediately.\u00a0<\/p>\n<p>As agents move beyond simple code suggestions to interacting with real environments like UIs, browsers, and IDEs, correctness becomes multi-path. Loading screens can appear or disappear, timing shifts, and multiple valid action sequences can lead to the same result. Unless our GitHub Actions workflows are robust enough to account for this variability, it\u2019s common for an agent to succeed at a task while the test still fails\u2014a \u201cfalse negative\u201d that halts production.\u00a0<\/p>\n<p>This blog post explores how to move past brittle, step-by-step scripts and toward an independent \u201cTrust Layer\u201d for agentic validation. We will demonstrate a model that focuses on essential outcomes rather than rigid paths, providing a way to validate behavior that is explainable, lightweight, and ready for real-world CI pipelines.<\/p>\n<p>The challenges of agent-driven validation<\/p>\n<p>Imagine you\u2019re responsible for a GitHub Actions pipeline that relies on Copilot Agent Mode to validate real-world workflows. The agent could be leveraging Computer Use, navigating within a containerized cloud environment, for the workflow validation.\u00a0<\/p>\n<p>On Tuesday, the build is green. On Wednesday, the test fails\u2014even though no code has changed.<\/p>\n<p>Here\u2019s what happened: A minor network lag on the hosted runner caused a loading screen to persist for a few extra seconds. The agent waited, adapted, and successfully completed the tasks correctly. But your CI pipeline still flagged the run as a failure\u2014not because the task failed, but because the execution path no longer matched the recorded script or assertion timing.\u00a0<\/p>\n<p>The agent didn\u2019t fail. The validation did.\u00a0<\/p>\n<p>This surfaces three recurring pain points that create a \u201ctrust gap\u201d in agent-driven testing:<\/p>\n<p>False negatives: The task succeeded, but the test runner could not tolerate variation.<\/p>\n<p>Fragile infrastructure: Tests fail due to timing, rendering, or environmental noise unrelated to correctness.<\/p>\n<p>The compliance trap: The outcome may be correct, but a regression is flagged because the agent\u2019s behavior diverges from what the automated test expected.\u00a0<\/p>\n<p>We\u2019re in a transition period where agentic systems like Github Copilot Coding Agent are enabling faster development, but our traditional validation approaches remain rigid. In deterministic software, correctness is as simple as matching a specific input to a known output. But with agents, the process in between is intentionally non-deterministic. As agents are increasingly deployed in production, correctness isn\u2019t about following a prescribed set of steps\u2014it\u2019s about \u201creliably achieving the essential outcomes.\u201d<\/p>\n<p>To scale these systems, we need a validation framework that can distinguish between \u201cincidental noise\u201d (e.g., a loading screen) and \u201ccritical failures\u201d (e.g., failing to save data). Correctness shifts from \u201cdid this happen?\u201d to \u201cwhat had to happen for success to be real?\u201d\u00a0<\/p>\n<p>Why existing testing approaches break down for autonomous agents<\/p>\n<p>Traditional testing tools work well when execution paths are fixed. They struggle when behavior branches\u2014the tools begin to fracture, not because they\u2019re poorly engineered, but because they assume a stable sequence.\u00a0<\/p>\n<p>When we apply these to a Copilot Coding Agent, including when navigating a containerized environment, the limitations become clear across four common paradigms:<\/p>\n<p>Assertion-based testing: Requires manual, labor-intensive specifications for every check and fails to account for valid alternative execution paths.<\/p>\n<p>Record-and-replay tools: Highly sensitive to environmental noise; minor rendering differences or timing variations often trigger false failures.<\/p>\n<p>Visual regression testing: Compares screenshots in isolation without understanding the broader execution flow or semantic meaning.<\/p>\n<p>ML oracles: These \u201cblack boxes\u201d require thousands of training examples and offer no explainability when they flag a behavior as incorrect.<\/p>\n<p>While these approaches differ in implementation, they share a common structural assumption: Correctness is defined by adherence to a particular sequence of observable states.<\/p>\n<p>For agentic systems, that assumption breaks down. To build true developer trust in these systems, including Github Copilot, we must move beyond checking linear scripts and start validating structured behaviors.<\/p>\n<p>Reframing correctness: Essential vs. optional behavior<\/p>\n<p>To move past brittle tests and build the Trust Layer, we have to fundamentally change how we define \u201ccorrect.\u201d In agentic systems, correct executions don\u2019t have to look identical. They do need to share a common logical structure.<\/p>\n<p>The conceptual shift<\/p>\n<p>Think of a computer use-enabled Github Copilot Coding Agent performing a search in VS Code in a containerized cloud environment. In one run, a loading screen appears for several seconds; in another, the UI loads instantly (shown below).<\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" height=\"582\" width=\"1024\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-05-at-6.10.57-PM.png\" alt=\"A computer use-enabled Github Copilot Coding Agent shown performing a search in VS Code in a containerized cloud environment.\" class=\"wp-image-95758\"  \/>Scenario: Opening VS Code<\/p>\n<p>A traditional test sees these as two different results. But to a developer, the loading screen is incidental; it doesn\u2019t change whether the task was successful.<\/p>\n<p>We can classify agent behavior into three categories:<\/p>\n<p>Essential states: Milestones that must occur for success to be real, such as reaching the \u201cSearch Results\u201d screen.<\/p>\n<p>Optional variations: Incidental states such as loading spinners or decorative UI changes that vary based on environment.<\/p>\n<p>Convergent paths: Different sequences of steps (like using a hotkey vs. a menu) that ultimately rejoin at the same outcome.<\/p>\n<p>A loading screen may appear or not. But search results must appear. Only one of these determines correctness.\u00a0<\/p>\n<p>From intuition to theory: Dominator analysis<\/p>\n<p>The distinction between \u201cmust-have\u201d and \u201cincidental\u201d behaviors is a concept rooted in compiler theory known as dominator relationships.\u00a0<\/p>\n<p>In a control-flow graph, a node A \u201cdominates\u201d node B if every path from the start to B must go through A.<\/p>\n<p>By applying dominator analysis to agent execution traces, we can automatically identify:\u00a0<\/p>\n<p>Which states are mandatory<\/p>\n<p>Which states are optional<\/p>\n<p>Where different paths converge<\/p>\n<p>This lets us extract a minimal, explainable definition of correctness.\u00a0<\/p>\n<p>Modeling executions as graphs, not scripts<\/p>\n<p>To capture the complexity of agentic behavior, we must move away from treating executions as linear, one-dimensional scripts. Instead, our framework models behavior using a graph-based structure known as a Prefix Tree Acceptor (PTA).<\/p>\n<p>From linear traces to structured graphs<\/p>\n<p>In this model, an execution is not a series of commands but a directed graph where:<\/p>\n<p>Nodes represent observable states, such as screenshots for UI agents or code snapshots for development agents.<\/p>\n<p>Edges represent transitions, capturing the actions (clicks, keystrokes, or API calls) taken to move between states.<\/p>\n<p>Why graphs matter<\/p>\n<p>Treating executions as graphs allows us to represent branching and convergence\u2014concepts that are impossible to capture in a linear script.<\/p>\n<p>Branching accounts for non-deterministic environment changes, like the presence or absence of a loading screen.<\/p>\n<p>Convergence identifies where these different paths rejoin, signaling that the agent has successfully navigated a variation and returned to the primary task flow.<\/p>\n<p>By shifting the representation from a sequence of steps to a structured behavior model, we stop penalizing agents for taking a different path and start validating whether they followed a logically sound one.<\/p>\n<p>How we solve it: A structural approach to correctness<\/p>\n<p>To move agents from experimental demos to production-grade infrastructure, our team developed a novel validation algorithm that moves away from rigid scripts and instead learns by example. To test this, we focused on a complex non-deterministic environment: an AI agent navigating Visual Studio Code via \u201cComputer Use.\u201d By observing just 2\u201310 successful sessions, our algorithm automatically constructs a \u201cground truth\u201d model that distinguishes between an agent\u2019s valid variations and actual failures.<\/p>\n<p>The workflow: From traces to a \u201cmaster\u201d graph<\/p>\n<p>Capture (PTA Construction): We collected 2\u201310 successful execution traces and converted them into Prefix Tree Acceptors (PTAs), directed graphs where nodes represent observable UI states and edges represent actions.<\/p>\n<p>Generalize (Semantic Merging): Our algorithm merged these traces into a unified graph. It employed a three-tiered equivalence detection framework\u2014combining fast visual metrics with LLM semantic analysis\u2014to decide if two states are logically equivalent, such as ignoring a timestamp change while flagging a missing UI control.<\/p>\n<p>Extract the Skeleton (Dominator Analysis): We applied dominator analysis to the merged graph to identify \u201cessential states,\u201d milestones every successful run must pass through\u2014while automatically filtering out \u201coptional\u201d states like loading spinners.<\/p>\n<p>This approach is uniquely powerful for developers because it requires no manual specification and no large-scale model training. Because the resulting model is a graph of actual execution states, the decisions are entirely explainable. When validation fails, our algorithm provides clear failure reasoning by identifying exactly which essential state was missed.<\/p>\n<p>Deciding when two states are \u201cthe same\u201d<\/p>\n<p>State equivalence is the hardest problem in agent validation. For example, how do we know if two different screenshots represent the same logical UI state?\u00a0<\/p>\n<p>We solve this using a three-tier equivalence detection framework that moves from fast visual metrics to deep semantic understanding:<\/p>\n<p>Visual metrics: We use fast perceptual hashes and structural similarity (SSIM) to catch near-identical states immediately.<\/p>\n<p>Semantic analysis via LLM: When visual metrics are ambiguous, we use a multimodal LLM to decide if differences are semantically meaningful. For example, the LLM knows to ignore a timestamp change or a different window decoration but will flag a different error message or missing UI control.<\/p>\n<p>Conservative merging: We only merge states when the model is certain they are equivalent, allowing the graph to naturally branch where execution paths genuinely diverge.<\/p>\n<p>This is not a naive pixel-by-pixel comparison, nor is it \u201cLLM hand-waving\u201d where the model is asked to judge the whole task. By using the LLM defensively and sparingly to resolve specific ambiguities, our framework remains robust enough to handle UI noise but precise enough to detect a real regression.<\/p>\n<p>Once the various execution traces are merged into a unified graph, our algorithm applies dominator analysis to isolate the core skeleton of the task.<\/p>\n<p>Defining \u201cessential\u201d through dominance: In graph theory, State A dominates State B if every possible path from the start to B must pass through A. In our model, we define a state as essential if it is a dominator for the successful completion of the task.<\/p>\n<p>The filtering process: By calculating these mathematical relationships, the algorithm automatically distinguishes between \u201cmust-have\u201d milestones and \u201cincidental\u201d noise.<\/p>\n<p>In our VS Code experiments, the \u201cSearch Dialog\u201d state is identified as an essential milestone because it is a mathematical dominator\u2014it is logically impossible to reach the results without first triggering the search. Conversely, a \u201cLoading\u201d screen dominates nothing; because it is bypassed in faster runs, the algorithm flags it as an optional variation rather than a requirement for success. This ensures the \u201cTrust Layer\u201d framework only alerts you when a critical step is missed\u2014not when the environment fluctuates.<\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" height=\"586\" width=\"1024\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-05-at-6.11.19-PM.png\" alt=\"A computer use-enabled Github Copilot Coding Agent shown performing a search in VS Code in a containerized cloud environment. This time, the flow from State 1.1 to State 1.2 and State 1.3 is labeled 'Loading screen or intermediate blank Desktop screen (State 1.2) are optional states.' And the flow from State 2.1 to State 2.2 is labeled 'Desktop with VS Code icon being clicked or Start Menu are some example states (State 2.1) which should precede open VS Code window (State 2.2).\" class=\"wp-image-95759\"  \/>Scenario: Opening VS Code<\/p>\n<p>By extracting these essential nodes into a dominator subtree, we create a \u201cground truth\u201d model that represents the minimal, explainable definition of correctness. This shifts the validation focus away from the specific steps the agent took and toward the critical checkpoints it was required to hit.<\/p>\n<p>Validating new executions in practice<\/p>\n<p>With the dominator tree established as our ground truth, validating a new, unseen execution becomes a process of structural comparison rather than a search for a perfect match. This ensures that as long as an Agent Mode hits the \u201cmust-have\u201d milestones, it is free to navigate the environment, or adapt its integrated Computer Use path, as it sees fit.<\/p>\n<p>When a new execution trace arrives, our validation algorithm extracts its sequence of states and checks it against the dominator tree using topological subsequence matching.<\/p>\n<p>The logic: We don\u2019t require the new trace to be identical to the reference; we only require that the essential states appear in the correct relative order.<\/p>\n<p>Handling extras: If the reference sequence is A \u2192 B \u2192 C and the agent produces A \u2192 X \u2192 B \u2192 Y \u2192 C, the test still passes because the extra states (X, Y) are treated as incidental noise.<\/p>\n<p>Detecting failure: A failure is triggered only if an essential state is skipped or if the states appear out of the required logical order.<\/p>\n<p>Scoring and explainability<\/p>\n<p>Our framework produces more than just a binary pass\/fail; it provides a coverage metric and a clear explanation:<\/p>\n<p>Coverage: Calculated as the percentage of matched essential states relative to the total number of states in the reference model.<\/p>\n<p>Failure reasoning: If a trace fails, our algorithm identifies exactly which state was missing (e.g., \u201cFailed: State \u2018Search Results\u2019 never reached after \u2018Search Dialog\u2019\u201d).<\/p>\n<p>This level of detail transforms the validation from a \u201cblack box\u201d into a diagnostic tool that developers can actually use to debug their agents and their environments.<\/p>\n<p>What we learned from evaluation<\/p>\n<p>To prove the efficacy of this structural approach for Trust Layer, we conducted a controlled experiment comparing our Dominator Tree method against an agent\u2019s self-assessment (where the Computer-Use Agent, or CUA, reports its own success) in a real-world scenario: a Copilot Agent custom VS Code extension test suite.<\/p>\n<p>The accuracy gap<\/p>\n<p>In tests designed to differentiate successful executions from those failing due to product bugs or agent errors, the results were definitive:<\/p>\n<p>MetricCUA Self-AssessmentPTA (Dominator Tree)Accuracy82.2%\u00a0100% (+17.8)\u00a0Precision83.3%\u00a0100% (+16.7)\u00a0Recall60.0%\u00a0100% (+40.0)\u00a0F1-Score69.8%100% (+30.2)\u00a0<\/p>\n<p>While the agent (CUA) frequently misreported failures as successes, often due to timing out or misinterpreting its own state, the Dominator Tree achieved perfect differentiation by focusing on whether essential milestones were actually reached.<\/p>\n<p>Identifying \u201cnot a bug\u201d scenarios<\/p>\n<p>The most significant impact for developers is in the reduction of \u201cfalse alarms.\u201d When a test fails, you need high-signal feedback to know if the product code is broken or if the agent simply stumbled due to environmental noise.<\/p>\n<p>The \u201cSelf-Verification\u201d Gap: In our evaluation, the agent\u2019s internal self-assessment (CUA) was completely unable to identify \u201cNot a Bug\u201d scenarios (0% F1-score). This shows that agents cannot yet reliably grade their own homework in non-deterministic environments.<\/p>\n<p>The Structural Advantage: By using state and action equivalence within the dominator model, our independent Trust Layer achieved a 52.2% F1-score in correctly identifying when a failure was an agent execution error rather than a product regression.<\/p>\n<p>The takeaway<\/p>\n<p>Structural validation beats self-reported success by a wide margin. By moving the \u201csource of truth\u201d from the agent\u2019s internal logic to a learned external structure, we can significantly reduce the manual review time wasted on flaky test results and false positives in CI pipelines.<\/p>\n<p>Where this fits in developer workflows today<\/p>\n<p>For this Trust Layer framework to be effective, it must move beyond a research prototype and integrate directly into the systems developers use every day. By treating correctness as a learned structure rather than a rigid script, we can significantly improve the reliability of production-grade automation within the GitHub ecosystem.<\/p>\n<p>Integration points<\/p>\n<p>This approach is designed to strengthen several critical areas of the software development lifecycle:<\/p>\n<p>GitHub Actions Pipelines: By reducing false negatives caused by environmental noise (like transient loading screens), this method provides a \u201chigher signal\u201d for automated builds, preventing unnecessary pipeline blocks.<\/p>\n<p>Regression testing: Developers can use a handful of verified traces from a stable version to create a \u201cground truth\u201d model that automatically validates future updates.<\/p>\n<p>Agent evaluation: Instead of relying on an agent to report its own success, teams can use structural validation to measure how often an agent actually hits essential milestones.<\/p>\n<p>UI automation: The framework allows for more robust automation of complex desktop and web apps where UI elements or paths may shift slightly between versions.<\/p>\n<p>The ultimate goal of this framework is to move agents from \u201cexperimental demos\u201d to \u201cproduction infrastructure.\u201d By providing reasoning, where a failure clearly points to a missing essential state, we give developers the transparency they need to trust autonomous systems in their workflows.<\/p>\n<p>What\u2019s next<\/p>\n<p>While structural validation represents a significant leap forward, our current framework has a few boundaries as it moves toward full maturity.<\/p>\n<p>Current limitations include:<\/p>\n<p>Requirement for success traces: The algorithm \u201clearns by example,\u201d meaning it requires 2\u201310 successful execution traces to build its ground truth model. It cannot yet learn or define correctness exclusively from failure logs.<\/p>\n<p>LLM dependency: Our semantic equivalence checking currently relies on multimodal LLM access. While this enables the \u201cintelligence\u201d to ignore timestamps or window decorations, it introduces an external API dependency and associated latency into the validation layer.<\/p>\n<p>Temporal blind spots: The current implementation validates the order of events, but cannot yet flag if a specific state (like a loading spinner) persists for too long.<\/p>\n<p>Future work includes:<\/p>\n<p>Temporal and negative constraints: Future work focuses on capturing timing requirements (e.g., \u201cloading must resolve within five seconds\u201d) and learning from negative examples to explicitly block known failure paths.<\/p>\n<p>Hierarchical and multimodal abstraction: The framework will evolve to cluster low-level screenshots into high-level concepts (e.g., a \u201cLaunch Sequence\u201d) while integrating non-visual signals like DOM structures, accessibility trees, and network traffic.<\/p>\n<p>Online learning: We aim to implement real-time model refinement. As our algorithm validates new successful runs, it will recompute dominators to continuously improve its understanding of what is truly \u201cessential.\u201d<\/p>\n<p>Why this matters now<\/p>\n<p>As AI agents move from experimental demos to core infrastructure, validation has to evolve with them and move past brittle scripts to resilient systems.\u00a0<\/p>\n<p>We don\u2019t need black-box models to judge other black-box models. We need structural guarantees developers can inspect, reason about, and trust.\u00a0<\/p>\n<p>By combining classic compiler theory (i.e., dominator analysis) with multimodal AI, we\u2019ve demonstrated that it\u2019s possible to learn an explainable, robust definition of success from just a handful of examples. This framework for the Trust Layer provides:<\/p>\n<p>Efficient learning: Automatic derivation of ground truth from passing examples.<\/p>\n<p>Operational robustness: Secure handling of non-deterministic behavior and environmental noise.<\/p>\n<p>Total transparency: Explainable results with clear reasoning that developers can act upon.<\/p>\n<p>As we move forward, focusing on these practical, explainable paths will be essential to ensuring that the GitHub Copilot Coding Agent is not just powerful, but also a trustworthy component of the developer workflow. This is particularly critical with the increasing adoption of Computer Use in the overall AI-native development lifecycle. By moving the \u201csource of truth\u201d from an agent\u2019s internal logic to a learned external structure, we provide the guarantees needed to make autonomous agents viable, production-grade tools in modern infrastructure.<\/p>\n<p>Our journey toward verifiable autonomy is just beginning. For a deep dive into our Dominator Analysis-based framework, you can <a href=\"https:\/\/arxiv.org\/pdf\/2605.03159\" rel=\"nofollow noopener\" target=\"_blank\">read the complete paper<\/a>.<\/p>\n<p>\t\tWritten by\t<\/p>\n<p>\t\t\t\t\t<img class=\"d-block circle\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/05\/6358491.jpeg\" alt=\"Gaurav Mittal\" width=\"80\" height=\"80\" loading=\"lazy\" decoding=\"async\"\/><\/p>\n<p>Principal Researcher, Microsoft Code | AI. I am a tech lead focused on product-driven AI research to improve the developer ecosystem and Github Copilot experience via intelligent and reliable models and agentic frameworks.<\/p>\n<p>\t\t\t\t\t<img class=\"d-block circle\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/05\/16918004.png\" alt=\"Reshabh Kumar Sharma\" width=\"80\" height=\"80\" loading=\"lazy\" decoding=\"async\"\/><\/p>\n<p>Student Researcher (former), Microsoft Code | AI. I am a PhD student at UW focused on improving the reliability and maintainability of LLM agents, using best practices from traditional software engineering.<\/p>\n","protected":false},"excerpt":{"rendered":"Modern software testing is built on a fragile assumption: correct behavior is repeatable. For deterministic code, that assumption&hellip;\n","protected":false},"author":2,"featured_media":30144,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[179,7493,405,19618,19619,8245,2225],"class_list":{"0":"post-30143","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-agentic-ai","8":"tag-agentic-ai","9":"tag-agentic-artificial-intelligence","10":"tag-ai-agents","11":"tag-dominator-analysis","12":"tag-github-actions","13":"tag-github-copilot","14":"tag-llms"},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/30143","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/comments?post=30143"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/30143\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media\/30144"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media?parent=30143"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/categories?post=30143"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/tags?post=30143"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}