The core problem was not average quality, but variance. Production traffic has spikes, tool outages and adversarial users. If every request must depend on one model with one latency curve and one pricing curve, then your tail behavior will dominate your user experience. In practice, your p95 and p99 are what people remember. 

This is one reason why operational guidance like NIST’s AI Risk Management Framework ends up mattering in agent design: it pushes teams to think about reliability, monitoring and governance as first-class concerns, not post-launch cleanup. Once you start to frame agents as risk-bearing systems, single-model centralization starts to look a lot like technical debt you are knowingly incurring. 

I have also found that single-model setups make incident response slower. If model quality drops, is it a model update issue, prompt regression, retrieval drift, tool contract breakage, context truncation or an evaluation blind spot? With one giant pathway, everything is coupled. Coupling is expensive during incidents.