Samuel Colvin, the creator of the popular Pydantic library and founder of Pydantic AI, recently took the stage at AI Engineer Europe to discuss the nuances of optimizing AI agents in production environments. His talk, titled “Playground in Prod,” focused on the practical challenges and advanced techniques for refining agent performance, emphasizing the shift from simple observability to robust evaluation and iterative improvement.

Samuel Colvin on Optimizing AI Agents in Production - AI Engineer

Samuel Colvin on Optimizing AI Agents in Production — from AI Engineer

The Need for Production-Ready AI Agents

Colvin began by highlighting the common misconception that AI agents are purely experimental tools. He stressed that for AI to be truly valuable, it must perform reliably and efficiently in production. This requires a more sophisticated approach than simply deploying an agent and hoping for the best. The core challenge, he explained, lies in understanding and improving agent performance over time, which necessitates moving beyond basic logging to a more rigorous evaluation framework.

Introducing GEPA and Managed Variables

A central theme of Colvin’s presentation was the concept of GEPA, or Genetic-Pareto Prompt Evolution. This technique involves using evolutionary algorithms to iteratively discover and refine prompts that yield optimal results from AI models. GEPA explores a vast space of prompt variations, allowing for the automated discovery of high-performing prompts without manual trial and error. Complementing this is Pydantic’s Logfire platform, which introduces ‘managed variables.’ These variables allow for dynamic control of application parameters, including prompts, without the need for redeploying the entire system. This capability is crucial for rapid experimentation and iteration in production settings.

From Observation to Evaluation

Colvin drew a distinction between AI observability and evaluation. While observability tools can provide insights into what an agent is doing, true optimization requires a framework for evaluating its performance against specific goals. He argued that standard logging and metrics are insufficient for this purpose. Pydantic’s approach, as demonstrated in the talk, involves creating a ‘golden dataset’ – a set of high-quality, curated examples with known correct outputs. This dataset then serves as the benchmark against which agent performance is measured. By running agents against this dataset and analyzing metrics like accuracy, precision, and recall, developers can identify areas for improvement.

Practical Demonstration and Key Takeaways

Colvin showcased a practical example involving the extraction of political relations from Wikipedia pages. He demonstrated how the system uses GEPA to evolve prompts that accurately identify these relationships, comparing the performance of different prompt strategies. The evaluation process, facilitated by Logfire’s instrumentation, provides detailed metrics that highlight the effectiveness of each prompt. He also touched upon the importance of having a robust data pipeline, including mechanisms for caching and efficient data loading, to support these iterative optimization cycles.

The session underscored the growing maturity of AI development, moving from theoretical possibilities to practical, production-ready solutions. By combining advanced optimization techniques like GEPA with powerful observability and evaluation tools, developers can unlock the full potential of AI agents.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.