{"id":249480,"date":"2025-07-09T01:30:06","date_gmt":"2025-07-09T01:30:06","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/249480\/"},"modified":"2025-07-09T01:30:06","modified_gmt":"2025-07-09T01:30:06","slug":"flashes-of-brilliance-and-frustration-i-let-an-ai-agent-run-my-day","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/249480\/","title":{"rendered":"&#8216;Flashes of brilliance and frustration&#8217;: I let an AI agent run my day"},"content":{"rendered":"<p><img decoding=\"async\" class=\"Image\" alt=\"New Scientist. Science news and long reads from expert journalists, covering developments in science, technology, health and the environment on the website and the magazine.\" width=\"1350\" height=\"900\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/SEI_257715560.jpg\"   loading=\"eager\" fetchpriority=\"high\" data-image-context=\"Article\" data-image-id=\"2486953\" data-caption=\"\" data-credit=\"\"\/><\/p>\n<p>I will never forget the kung pao chicken I sat down to eat a few months ago. Not because the taste blew me away \u2013 20 minutes on the back of a delivery rider\u2019s scooter had sullied that somewhat. What made the meal memorable was that I hadn\u2019t really ordered it at all. Yet there it was, in front of me.\u00a0<\/p>\n<p>An AI assistant called Operator, developed by ChatGPT-maker OpenAI, had ordered the food on my behalf. The tech industry has dubbed such assistants \u201cAI agents\u201d, and several are now commercially available. These AI agents have the potential to transform our lives by carrying out mundane tasks, from answering emails to shopping for clothes and ordering food. Microsoft chief financial officer Amy Hood <a href=\"https:\/\/www.businessinsider.com\/microsoft-q3-2025-earnings-cfo-amy-hood-email-employees-2025-4\" target=\"_blank\" rel=\"noopener\">reportedly said<\/a> in a recent internal memo that agents \u201care pushing each of us to think differently, work differently\u201d and are \u201ca glimpse of what\u2019s ahead\u201d. In that sense, my kung pao chicken was a taste of the future.\u00a0<\/p>\n<p>But what will that future be like? To find out, I decided to put Operator and a rival product named Manus, developed by Chinese start-up <a href=\"https:\/\/www.butterflyeffect.ai\/\" target=\"_blank\" rel=\"noopener\">Butterfly Effect<\/a>, through their paces. Working with them was a mixed bag: amid the flashes of brilliance, there were moments of frustration, too. In the process, I also got a glimpse of the risks to which we are exposing ourselves. Because fully embracing these tools requires handing them the keys to our finances and our list of social contacts, as well as trusting them to perform tasks the way we want them to. Are we ready for the world of AI agents, or will they be hard to stomach?\u00a0<\/p>\n<p>Since 2023, we have lived in the era of generative AI. Built using <a href=\"https:\/\/www.newscientist.com\/article\/mg25934590-600-we-still-dont-really-understand-what-large-language-models-are\/\" target=\"_blank\" rel=\"noopener\">large language models (LLMs)<\/a> and trained on huge volumes of data scraped mainly from web pages, generative AI can create <a href=\"https:\/\/www.newscientist.com\/article\/2322056-will-ai-text-to-image-generators-put-illustrators-out-of-a-job\/\" target=\"_blank\" rel=\"noopener\">original content such as text or images<\/a> in response to commands given in everyday language. It would be fair to say that this AI has made quite a splash, judging by the volume of media coverage devoted to the technology, and has already changed the world significantly.\u00a0<\/p>\n<p>The rise of agentic AI<\/p>\n<p>Agentic AI promises to take things one step further. It is \u201cempowered with actually doing something for you\u201d, says <a href=\"https:\/\/www.cs.utexas.edu\/~pstone\/\" target=\"_blank\" rel=\"noopener\">Peter Stone<\/a> at the University of Texas at Austin. Over the past few years, many of us have grown used to the idea of asking a generative AI for information \u2013 recommendations of favourite dishes available in the neighbourhood, for instance, and contact details for the restaurants from which that food can be ordered. But ask agentic AI, \u201cWhat should I eat tonight?\u201d and it can pick out dishes it thinks you will like from a restaurant\u2019s website and \u2013 if there is an online order form \u2013 pay for the food using your credit card, arrange for it to be sent to your home and let you know when to expect the delivery. \u201cThat will feel like a fundamentally different experience,\u201d says Stone: AI as an autopilot rather than a copilot.\u00a0<\/p>\n<p>Building an agentic AI with this sort of capability is trickier than it might appear. LLMs are still the driving force under the surface, but with agentic AI, they focus their processing power on the decisions they can make and the real-world actions they can take based on the digital tools \u2013 including web browsers and other computer-based apps \u2013 at their disposal. When given a goal such as \u201corder dinner\u201d or \u201cbuy me some shoes\u201d, the AI agent develops a multi-step plan involving those digital tools. It then monitors and analyses how close the output at each step is to the ultimate goal, and reassesses what else needs to be done. This process continues until the agent is satisfied it has reached the ultimate goal \u2013 or come as close to doing so as possible. And once the act is done, the system asks whether it achieved the goal successfully, a form of feedback also present in AI chatbots, called reinforcement learning from human feedback.\u00a0<\/p>\n<p>Stone, who is the founder and director of the Learning Agents Research Group at his university, has spent decades thinking about the possibility of AI agents. They are, he says, systems that \u201csense the environment, decide what to do and take an action\u201d. Put in those terms, it may feel as if AI agents have been with us for years. For instance, IBM\u2019s Deep Blue computer appeared to have reacted to events on a real-world chessboard to <a href=\"https:\/\/www.newscientist.com\/article\/mg23431280-300-we-dont-need-to-lose-out-to-machines-says-the-man-who-did\/\" target=\"_blank\" rel=\"noopener\">beat former World Chess Champion Garry Kasparov<\/a> in 1997. But Deep Blue wasn\u2019t an agentic AI, says Stone. \u201cIt was decision-making, but it wasn\u2019t sensing or acting,\u201d he says. It relied on human operators to move chess pieces on its behalf and to inform it about Kasparov\u2019s moves. An AI agent doesn\u2019t need human help to interact with the real world. \u201cLanguage models that were disembodied or disconnected from the world are now being connected [to it],\u201d says Stone.\u00a0<\/p>\n<p>Early versions of these agentic AIs are now available from many tech firms, with each, whether it is Microsoft, Amazon or the software firm Oracle, offering its own. I was eager to see how they work in practice, but doing so isn\u2019t cheap: some come with annual subscription fees running to thousands of dollars. I reached out to OpenAI and Butterfly Effect and asked for a free trial of their products \u2013 Operator and Manus, respectively. Both accepted my request. My plan was to use the AIs as personal assistants, taking on my grunt work so I would have more free time.\u00a0<\/p>\n<p><img decoding=\"async\" class=\"Image\" alt=\"A person working in a cafe on a laptop\" width=\"1350\" height=\"901\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/SEI_257394521.jpg\"   loading=\"lazy\" data-image-context=\"Article\" data-image-id=\"2486281\" data-caption=\"Will AI agents soon take care of our boring work admin?\" data-credit=\"Kuan Chang Chen\/Millennium Images, UK\"\/><\/p>\n<p class=\"ArticleImageCaption__Title\">Will AI agents soon take care of our boring work admin?<\/p>\n<p class=\"ArticleImageCaption__Credit\">Kuan Chang Chen\/Millennium Images, UK<\/p>\n<\/p>\n<p>The results were mixed. I was due to give a presentation in a few weeks, so I uploaded my slide deck to Manus\u2019s online interface and asked the AI agent to reformat it. Manus seemed to have done a good job, but after opening the slide deck in PowerPoint, I realised that it had placed every line of text in a separate text box, meaning it would be annoyingly fiddly for me to make additional edits myself. Manus did, however, fare better at compiling code for an app I wanted to upload into an app store-ready format, using various tools and its remote computer\u2019s command line to do so.\u00a0<\/p>\n<p>Turning to Operator, I began by asking the AI agent to handle my online invoicing system. Like a well-meaning but not particularly helpful intern, it insisted on filling out the form the wrong way: inputting text defining the work for which I was invoicing into a box that could receive only numeric codes. I eventually managed to break it out of that habit, but then Operator got confused when copying over details from my \u201cto invoice\u201d list to the system, with potentially embarrassing results. Notably, it suggested I submit an invoice to the New Scientist accounts team asking for an \u00a38001 payment for a single article.\u00a0<\/p>\n<p>It was with some trepidation, then, that I gave Operator a promotion and asked for its help in reporting this story. I had already used ChatGPT to identify AI experts who could comment on the rise of agentic AIs. I asked Operator to send each expert an email on my behalf requesting an interview. The results, which I didn\u2019t see until the emails had already been sent, made me inwardly cringe \u2013 not least because Operator decided against acknowledging its role in composing them, giving the impression that I had written them myself. The language the AI agent used was simultaneously naive and too formal, with staccato sentences fired with a semi-hostility that put me \u2013 and, in all likelihood, the would-be interviewees \u2013 on edge. Operator also failed to mention some key information, including that my story would be published by New Scientist. In that way, it felt a lot like a junior assistant. Not really knowing how to write an email as I would, Operator made many mistakes.\u00a0<\/p>\n<p>In Operator\u2019s defence, however, the emails were at least partially successful. It was through an Operator email that I made contact with Stone, for instance, who took the AI-sent email in his stride. Another researcher complimented me on the approach when I later disclosed that the email had been written by Operator. \u201cThat\u2019s serious dogfooding!\u201d they said \u2013 tech slang for testing experimental new products \u2013 although they declined to speak for this story because the funders of a project they were working on wouldn\u2019t let them. <\/p>\n<p>Who does an AI agent really work for?<\/p>\n<p>The tech companies behind these AI agents present the technology as if it is an indefatigable digital assistant. But the truth is that, in my experience, we aren\u2019t quite there yet. Still, assuming the tech is going to improve, how should we view these new tools? To start with, it is worth pondering the commercial incentives that underpin all the hype, says <a href=\"https:\/\/www.philosophy.ox.ac.uk\/people\/carissa-veliz\" target=\"_blank\" rel=\"noopener\">Carissa V\u00e9liz<\/a> at the University of Oxford. \u201cOf course, the AI agent works for a company before they work for you, in the sense that they are produced by a company with financial interests,\u201d she says. \u201cWhat will happen when there are conflicts of interest between the company who essentially leases the AI agent and your own interests?\u201d \u00a0<\/p>\n<p>We can already see examples of this in the early AI agents: OpenAI has signed agreements with companies to collaborate on its system, so when searching for holiday flights, Operator may prefer Skyscanner over competitors, or turn first to the Financial Times and Associated Press if you ask it about the news. V\u00e9liz also suggests users consider privacy concerns before leaping headfirst into using agentic AI, given the tech\u2019s access to our personal information. \u201cThe essence of cybersecurity is to have different boxes for different things,\u201d says V\u00e9liz \u2013 using unique passwords for online banking and email, for instance, and never saving those passwords in a single document \u2013 but to use an AI agent, we must break down the barriers between those boxes. \u201cWe\u2019re giving these agents the key to a system in which everything is connected, and that makes them very unsafe,\u201d she says. \u00a0<\/p>\n<p>It is a warning I can appreciate. I wasn\u2019t particularly happy that my trial with Operator necessarily involved ceding control of my email and accounting software to the AI agent \u2013 and my level of unease hit new heights when I asked Operator to order the dish of kung pao chicken on my behalf. At one point, the AI agent asked me to type my credit card details into a computer window that had popped up in the Operator chatbot interface. I reluctantly did so, even though I felt I didn\u2019t fully control the window and that I was placing an enormous amount of trust in Operator.\u00a0<\/p>\n<p>Moreover, as things stand, it isn\u2019t completely clear that AI agents have earned such trust. By definition, they tend to \u201caccess a lot of tools and interact a lot more with the outside world\u201d, says <a href=\"https:\/\/www.linkedin.com\/in\/mehrnoosh-sameki\/\" target=\"_blank\" rel=\"noopener\">Mehrnoosh Sameki<\/a>, principal project manager of generative AI evaluation and governance at Microsoft. This makes them vulnerable to certain types of attack. \u00a0<\/p>\n<p><a href=\"https:\/\/tianshili.me\/\" target=\"_blank\" rel=\"noopener\">Tianshi Li<\/a> at Northeastern University in Massachusetts recently looked at six leading agents, and <a href=\"https:\/\/arxiv.org\/html\/2411.01344v2\" target=\"_blank\" rel=\"noopener\">studied those vulnerabilities<\/a>. She and her team found that agents could fall prey to relatively simple tricks. For instance, deep within the text of a privacy policy that few people would read, a malicious actor might hide a request to click a link and insert credit card details. Li\u2019s team found that an AI agent wouldn\u2019t hesitate to carry out the request. \u201cI think there are a lot of very legitimate concerns these agents might not act in accordance with people\u2019s expectations,\u201d she says. \u201cAnd there is no effective mechanism to allow people to intervene or remind them of this possibility and to avoid the possible consequences.\u201d \u00a0<\/p>\n<p>OpenAI declined to comment on the concerns raised by Li\u2019s research \u2013 although my experience using Operator suggests the company is aware of the trust-and-control issue. For instance, Operator seemed to go out of its way to constantly ping me notifications to check if the actions it wanted to take aligned with my expectations. The inevitable downside to that strategy, however, is that it made me feel that I was devoting so much time to micromanaging the agent\u2019s work that I would have been quicker just performing the tasks myself.\u00a0<\/p>\n<p><img decoding=\"async\" class=\"Image\" alt=\"CUBA. Guardalavaca. Playa Pesquero. All inclusive resort. 2017.\" width=\"1350\" height=\"900\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/SEI_257394439.jpg\"   loading=\"lazy\" data-image-context=\"Article\" data-image-id=\"2486282\" data-caption=\"AI agents can carry out tasks with results in the real world, including booking holidays\" data-credit=\"MARTIN PARR\"\/><\/p>\n<p class=\"ArticleImageCaption__Title\">AI agents can carry out tasks with results in the real world, including booking holidays<\/p>\n<p class=\"ArticleImageCaption__Credit\">MARTIN PARR<\/p>\n<\/p>\n<p>\u201cWe\u2019re still [in the] early days in a lot of these agentic experiences,\u201d admits <a href=\"https:\/\/www.linkedin.com\/in\/colin-jarvis-50019658\/?originalSubdomain=uk\" target=\"_blank\" rel=\"noopener\">Colin Jarvis<\/a>, who leads OpenAI\u2019s deployed engineering team. Jarvis says the current crop of AI agents are far from achieving their full potential. \u201cIt still needs quite a bit of work to get that reliability,\u201d he says. \u00a0<\/p>\n<p>Butterfly Effect made a similar point. When I reached out to the firm to discuss my problems using its agent, I was told that \u201cManus is currently in its beta stage, and we are actively working on optimising and improving its performance and functionality\u201d.\u00a0<\/p>\n<p>Tech firms have arguably been struggling to get agentic AI working for several years. In 2018, for instance, Google argued that a version of an AI agent it had developed, <a href=\"https:\/\/research.google\/blog\/google-duplex-an-ai-system-for-accomplishing-real-world-tasks-over-the-phone\/\" target=\"_blank\" rel=\"noopener\">called Duplex<\/a>, was going to change the world. The company touted Duplex\u2019s ability to call up restaurants and reserve tables for its customers. But, for reasons unknown, it never took off as an everyday tool with widespread appeal. \u00a0<\/p>\n<p>Beyond the hype<\/p>\n<p>Nevertheless, AI companies and tech analysts alike say the agentic AI revolution is just around the corner. The number of mentions of agentic AI on financial earnings calls at the end of last year was <a href=\"https:\/\/www.bain.com\/insights\/what-is-agentic-ai\/\" target=\"_blank\" rel=\"noopener\">51 times greater<\/a> than it was in the first quarter of 2022. The interest here is not merely in using agents to assist human employees, but also to replace them. For example, companies including Salesforce, which helps businesses manage customer relations, are <a href=\"https:\/\/www.cio.com\/article\/3490043\/salesforce-unveils-autonomous-agents-for-sales-teams.html\" target=\"_blank\" rel=\"noopener\">rolling out AI agents<\/a> to sell services.\u00a0\u00a0<\/p>\n<p>Stone doesn\u2019t think the technology is quite ready for that kind of application. \u201cThere\u2019s a lot of overhype right now,\u201d he says. \u201cIt\u2019s certainly not going to be within the next few years that all jobs are gone or that autonomous agents are doing everything.\u201d To make good on the most ambitious claims, he says, \u201cfundamental algorithms\u2026 would need to be discovered\u201d. <\/p>\n<p>Enthusiasm may be high because <a href=\"https:\/\/www.newscientist.com\/article\/2460254-over-70-per-cent-of-students-in-us-survey-use-ai-for-school-work\/\" target=\"_blank\" rel=\"noopener\">tools like ChatGPT perform so well<\/a> that they have raised expectations of what AI can achieve more generally. \u201cPeople have extrapolated to say, \u2018Oh, if they can do that, they can do everything,\u2019\u201d says Stone. Certainly, I found that agentic AI can work extremely well \u2013 some of the time. But Stone says we shouldn\u2019t infer from a few limited examples that AI agents can do it all. <\/p>\n<p>On reflection, I am inclined to agree with him \u2013 at least until my version of Operator recognises that I consider no order from a Chinese restaurant truly complete without a side of prawn crackers.\u00a0<\/p>\n<p class=\"ArticleTopics__Heading\">Topics:<\/p>\n","protected":false},"excerpt":{"rendered":"I will never forget the kung pao chicken I sat down to eat a few months ago. Not&hellip;\n","protected":false},"author":2,"featured_media":249481,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-249480","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114820692615818717","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/249480","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=249480"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/249480\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/249481"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=249480"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=249480"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=249480"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}