{"id":318895,"date":"2025-08-05T04:28:10","date_gmt":"2025-08-05T04:28:10","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/318895\/"},"modified":"2025-08-05T04:28:10","modified_gmt":"2025-08-05T04:28:10","slug":"our-framework-for-developing-safe-and-trustworthy-agents-anthropic","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/318895\/","title":{"rendered":"Our framework for developing safe and trustworthy agents \\ Anthropic"},"content":{"rendered":"<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">The most popular AI tools today are assistants that respond to specific questions or prompts. But we\u2019re now seeing the emergence of <a href=\"https:\/\/www.youtube.com\/watch?v=LP5OCa20Zpg\" target=\"_blank\" rel=\"noopener\">AI agents<\/a>, which pursue tasks autonomously when given a goal. Think of an agent like a virtual collaborator that can independently handle complex projects from start to finish &#8211; all while you focus on other priorities.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agents direct their own processes and tool usage, maintaining control over how they accomplish tasks with minimum human input. If you ask an agent to &#8220;help plan my wedding,&#8221; it might autonomously research venues and vendors, compare pricing and availability, and create detailed timelines and budgets. Or if you ask it to \u201cprepare my company\u2019s board presentation&#8221;, it might search through your connected Google Drive for relevant sales reports and financial documents, extract key metrics from multiple spreadsheets, and produce a report.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Last year, we introduced <a href=\"https:\/\/www.anthropic.com\/claude-code\" target=\"_blank\" rel=\"noopener\">Claude Code<\/a>, an agent that can autonomously write, debug, and edit code, and is used widely by software engineers. Many companies are also building their own agents using our models. <a href=\"https:\/\/www.anthropic.com\/customers\/trellix\" target=\"_blank\" rel=\"noopener\">Trellix,<\/a> a cybersecurity firm, uses Claude to triage and investigate security issues. And <a href=\"https:\/\/www.anthropic.com\/customers\/block\" target=\"_blank\" rel=\"noopener\">Block<\/a>, a financial services company, has built an agent that allows non-technical staff to access its data systems using natural language, saving its engineers time.<\/p>\n<p>Principles for trustworthy agents<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">The rapid implementation of agents means it&#8217;s crucial that developers like Anthropic build agents that are safe, reliable and trustworthy. Today, we&#8217;re sharing an early framework for responsible agent development. We hope this framework can help establish emerging standards, offer adaptable guidance for different contexts, and contribute to building an ecosystem where agents align with human values.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We aim to adhere to the following principles when developing agents:<\/p>\n<p><img loading=\"lazy\" width=\"7200\" height=\"4050\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/1754368090_724_image\"\/>Keeping humans in control while enabling agent autonomy<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">A central tension in agent design is balancing agent autonomy with human oversight. Agents must be able to work autonomously\u2014their independent operation is exactly what makes them valuable. But humans should retain control over how their goals are pursued, particularly before high-stakes decisions are made. For example, an agent helping with expense management might identify that the company is overspending on software subscriptions. Before it starts cancelling subscriptions or downgrading service tiers, the company would likely want a human to give approval.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In Claude Code, humans can stop Claude whenever they want and redirect its approach. It has read-only permissions by default, meaning it can analyze and review information within the directory it&#8217;s initialized in without asking for human approval, but must ask for human approval before taking any actions that modify code or systems. Users can grant persistent permissions for routine tasks they trust Claude to handle.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">As agents become more powerful and prevalent, we\u2019ll need even more robust technical solutions and intuitive user controls. The right balance between autonomy and oversight varies dramatically across scenarios and likely includes a mix of built-in and customizable oversight features.<\/p>\n<p>Transparency in agent behavior<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Humans need visibility into agents\u2019 problem-solving processes. Without transparency, a human asking an agent to &#8220;reduce customer churn&#8221; might be baffled when the agent starts contacting the facilities team about office layouts. But with good transparency design, the agent can explain its logic: &#8220;I found that customers assigned to sales reps in the noisy open office area have 40% higher churn rates, so I&#8217;m requesting workspace noise assessments and proposing desk relocations to improve call quality.&#8221; This also provides an opportunity to nudge agents in the right direction, by fact-checking their data, or making sure they\u2019re using the most relevant sources.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In Claude Code, Claude shows its planned actions through a real-time to-do checklist, and users can jump in at any time to ask about or adjust Claude\u2019s workplan. The challenge is in finding the right level of detail. Too little information leaves humans unable to assess whether the agent is on track to achieve its goal. Too much can overwhelm them with irrelevant details. We try to take a middle ground but we\u2019ll need to iterate on this further.<\/p>\n<p><img loading=\"lazy\" width=\"7200\" height=\"4050\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/1754368090_513_image\"\/>                  Claude Code\u2019s to-do checklist which users can see in real-time<br \/>Aligning agents with human values and expectations<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agents don&#8217;t always act as humans intend. Our research has shown that when AI systems pursue goals autonomously, they can sometimes take actions that seem reasonable to the system but aren&#8217;t what humans actually wanted. If a human asks an agent to &#8220;organize my files,&#8221; the agent might automatically delete what it considers duplicates and move files to new folder structures\u2014going far beyond simple organization to completely restructuring the user&#8217;s system. While this stems from the agent trying to be helpful, it demonstrates how agents may lack the context to act appropriately even when their goals do align.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">More concerning are cases where agents pursue goals in ways that actively work against users&#8217; interests. <a href=\"https:\/\/www.anthropic.com\/research\/agentic-misalignment\" target=\"_blank\" rel=\"noopener\">Our testing of extreme scenarios<\/a> has shown that when AI systems pursue goals autonomously, they can sometimes take action that seem reasonable to the system but violate what humans actually wanted. Users may also inadvertently prompt agents in ways that lead to unintended outcomes.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Building reliable measures of agents\u2019 value alignment is challenging. It\u2019s hard to evaluate both the malign and benign causes of the problem at once. But we\u2019re actively figuring out how to resolve this problem. Until we have, the transparency and control principles above will be particularly important.<\/p>\n<p>Protecting privacy across extended interactions<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agents can retain information across different tasks and interactions. This creates several potential privacy problems. Agents might inappropriately carry sensitive information from one context to another. For example, an agent might learn about confidential internal decisions from one department while helping with organizational planning, then inadvertently reference this information when assisting another department \u2013 exposing sensitive matters that should remain compartmentalized.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Tools and processes that agents utilize should also be designed with the appropriate privacy protections and controls. The open-source <a href=\"https:\/\/www.anthropic.com\/partners\/mcp\" target=\"_blank\" rel=\"noopener\">Model Context Protocol<\/a> (MCP) we created, which allows Claude to connect to other services, includes controls to enable users to allow or prevent Claude from accessing specific tools and processes, or what we call \u201cconnectors\u201d in a given task. In implementing MCP, we included additional controls, such as the option to grant one-time or permanent access to information. Enterprise administrators can also set which connectors users in their organizations can connect to. We continue to explore ways to improve our privacy protection tooling.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We\u2019ve also outlined steps our customers should take to <a href=\"https:\/\/support.anthropic.com\/en\/articles\/11175166-getting-started-with-custom-integrations-using-remote-mcp\" target=\"_blank\" rel=\"noopener\">safeguard their data<\/a> through measures like access permissions, authentication, and data segregation.<\/p>\n<p>Securing agents\u2019 interactions<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agent systems should be designed to safeguard sensitive data and prevent misuse when interacting with other systems or agents. Since agents are tasked with achieving specific goals, attackers could trick an agent into ignoring its original instructions, revealing unauthorized information, or performing unintended actions by making it seem necessary to do so for the agent\u2019s objectives (also referred to as \u201cprompt injection\u201d). Or attackers could exploit vulnerabilities in the tools or sub-agents that agents use.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Claude already uses a system of <a href=\"https:\/\/www.anthropic.com\/research\/constitutional-classifiers\" target=\"_blank\" rel=\"noopener\">classifiers<\/a> to detect and guard against misuses such as prompt injections, in addition to several <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/claude-code\/security\" target=\"_blank\" rel=\"noopener\">other layers of security<\/a>. Our Threat Intelligence team conducts ongoing monitoring to assess and mitigate new or emerging forms of malicious behaviour. In addition, we <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/test-and-evaluate\/strengthen-guardrails\/mitigate-jailbreaks\" target=\"_blank\" rel=\"noopener\">provide guidance<\/a> on how organizations using Claude can further decrease these risks. Tools added to our <a href=\"https:\/\/www.anthropic.com\/news\/connectors-directory\" target=\"_blank\" rel=\"noopener\">Anthropic-reviewed MCP directory<\/a> must adhere to our security, safety, and compatibility <a href=\"https:\/\/support.anthropic.com\/en\/articles\/11697096-anthropic-mcp-directory-policy\" target=\"_blank\" rel=\"noopener\">standards<\/a>.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">When we discover new malicious behaviors or vulnerabilities through our monitoring and research, we strive to address them quickly and continuously improve our security measures to stay ahead of evolving threats.<\/p>\n<p>Next steps<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">As we continue developing and improving our agents, we expect our understanding of their risks and trade-offs to also evolve. Over time, we\u2019ll plan to revise and update this framework to reflect our view of best practices.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">These principles will guide our current and future work on agent development, and we look forward to collaborating with other companies and organizations on this topic. Agents have tremendous potential for positive impacts in work, education, healthcare, and scientific discovery. That is why it is so important to ensure they are built to the highest standards.<\/p>\n","protected":false},"excerpt":{"rendered":"The most popular AI tools today are assistants that respond to specific questions or prompts. But we\u2019re now&hellip;\n","protected":false},"author":2,"featured_media":318896,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-318895","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114974274625717687","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/318895","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=318895"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/318895\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/318896"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=318895"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=318895"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=318895"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}