Why AI Agents Fail in Production (Even When Benchmarks Look Great)
I came across this interesting paper that digs into something many of us working with AI agents have been feeling. There is a big gap between how agents perform on benchmarks and how they actually behave in real-world environments.
What's the Paper About?
The paper is called “Measuring Agents in Production.”
The study examines deployed agent systems across multiple industries including financial services, insurance, healthcare, and customer support, and looks at what actually matters when these systems are used by real people in real workflows.
The Disconnect
Academic benchmarks focus on whether an agent can complete a task in a clean, controlled environment. But in production, success depends on how well the agent performs in messy, real-world workflows where reliability, safety, and integration matter far more than benchmark scores.
Suddenly, the performance questions look very different:
- How well does it integrate with existing systems and tools?
- What happens when it hits unexpected scenarios or ambiguous inputs?
- How much human oversight is required to ensure correctness?
- Are we seeing real business impact, not just a checkmark on task accuracy?
- Can it operate reliably and safely under real workload conditions?
Why This Matters
If you have ever taken an AI agent from a controlled test into a real deployment, you have probably seen how quickly things can fall apart. The paper highlights this gap with detailed case studies, providing practical insights into the conditions that determine whether agents succeed or fail once real users and real workflows are involved.
Ultimately, evaluating agents cannot stop at whether they completed a task. Success in production depends on whether the agent performs safely, consistently, and in a way that drives genuine business value. As agents move into mission critical roles, the ability to measure, monitor, and trust their behavior becomes essential and not optional.
Key Takeaways
The paper is structured around four main research questions and here's what they found:
1. What Are The Applications, Users, and Requirements of Agents?
AI agents are now deployed in real, operational environments across a wide range of industries. They primarily support humans, not replace them, by accelerating workflows and reducing routine labor.
Where and how they are used:
- Finance, insurance, healthcare, tech, legal, research, IT operations
- Embedded within internal workflows and enterprise systems
- Serve human operators who remain accountable for final decisions
What organizations actually care about:
- Throughput and productivity gains
- High-quality, correct results in complex domains
- Compliance and alignment with business rules
- Smooth integration with existing tech stacks
What they don’t care as much about:
- Real-time latency (minutes are often acceptable)
- Fully autonomous decision-making
The goal isn’t “autonomous agents”. It’s human-augmented performance improvements at scale.
2. What Models, Architectures, And Techniques Are Used To Build Agents?
Real-world agents are far simpler and more controlled than what academic narratives suggest.
Model and development choices:
- Off-the-shelf frontier models selected for strong performance on complex tasks without additional fine-tuning
- Manual prompt construction remains dominant due to higher controllability and lower data/engineering burden
- Prompts often include extensive business rules and domain context, sometimes exceeding tens of thousands of tokens
Autonomy and workflow patterns:
- Most agents restrict execution to ten or fewer steps before human intervention to ensure reliable supervision
- Workflows are well-scoped and static, using predefined tools and decision paths rather than open-ended planning
- Autonomy intentionally constrained to manage risk, maintain predictability, and simplify debugging
System architecture:
- Custom in-house orchestration solutions are preferred over third-party agent frameworks, allowing tighter control and avoiding dependencies that may impact reliability or security
- Limited reliance on external agent frameworks to avoid complexity and dependency bloat
- Security and privacy requirements enforced directly within execution environments and tool access
Production agents operate as LLM-powered workflow engines, not open-ended planners.
3. How Are Agents Evaluated for Deployment?
Evaluation in production relies heavily on human oversight because standardized benchmarks do not exist for most real business tasks. Automated evaluation techniques like LLM as a judge are used to speed up review, but they always complement human validation rather than replace it.
Primary evaluation mechanisms:
- Human-in-the-loop review is the default safeguard for correctness and safety
- Escalation paths route uncertain outputs to domain experts
- Live monitoring tracks quality changes as users interact with the system
Practical evaluation strategies:
- LLM-as-a-judge used to triage outputs quickly, but always paired with human verification
- Automated checks catch structural or policy violations before reaching users
- Human auditing on a subset of interactions to detect subtle or evolving issues
- Custom evaluation datasets are built gradually from real interactions because ground truth examples rarely exist in advance. Creating them requires domain experts to label outputs, define correct behavior, and review edge cases. This takes significant time and effort, so these datasets expand slowly and never fully capture all situations the agent will encounter in production.
What success means in real deployments:
- Work gets done faster and scales across teams without adding headcount
- Manual effort and repetitive tasks are reduced, freeing people to focus on higher value work
- Outcomes improve key business metrics such as cost, throughput, or customer experience, not just accuracy scores
- Users trust the system because it behaves consistently, safely, and aligns with established rules and expectations
Evaluation is not pass/fail. It is ongoing assurance of trustworthiness.
4. What Are the Top Challenges in Building Production Agents?
Reliability remains the central challenge in every stage of deployment. Teams focus less on pushing model capability and more on ensuring predictable, correct, and secure behavior in real workflows.
Technical and operational barriers:
- Performance often breaks down when the agent sees unfamiliar inputs or workflow variations
- Correctness is hard to verify when feedback arrives days or months later
- Non-deterministic outputs make traditional CI/CD and regression testing ineffective
- Integration with legacy systems and enterprise infrastructure adds significant engineering overhead
Safety and compliance constraints:
- Action permissions are tightly restricted through sandboxed environments and limited access scopes
- Human review required for decisions with financial, legal, or customer impact
- Extensive logging and monitoring needed to quickly catch failures and prevent cascading issues
Organizational and user factors:
- Regulated industries have very low tolerance for mistakes, which limits how much autonomy agents are allowed to have
- Stakeholders expect outputs that can be explained, traced, and audited, not black box decisions
- Trust and accountability determine whether adoption scales beyond pilots
Companies don’t eliminate unreliability. They contain it.
What Should You Actually Do?
The paper lays out some practical recommendations:
1. Measure more than task completion
Don't just track if the agent finished the task. Measure speed, user satisfaction, how it handles edge cases, and whether it's delivering actual business value. Looking at "success rate" alone doesn't tell you much.
2. Build real-time monitoring from day one
You need to see what your agent is doing in production, not just in testing. Monitor failures, unexpected behaviors, and any decline in performance over time. I've learned this the hard way. Issues that never showed up in testing become obvious once real users start interacting with the system.
3. Design for human oversight
For decisions that affect finance, healthcare, or customers, human supervision is essential. Agents should not operate independently in these high-impact situations. Clear checkpoints must exist where humans can review and approve actions before they continue.
4. Create feedback loops
Your agent shouldn't stay the same after you deploy it. Build systems that learn from real usage and get better over time, while keeping things safe. This is where the real improvement happens.
5. Validate in the real context of use
Benchmarks rarely reflect actual workflows. Performance can change completely depending on the environment. For example, an agent that works well answering routine customer support questions may fail when applied to financial services where strict compliance and error-sensitive decisions are required. Evaluation & testing must reflect the real systems, constraints, and users the agent will face in production.
6. Plan for operational cost and effort
Reliable deployments require investment in oversight, governance, data maintenance, and monitoring infrastructure. Teams should budget for ongoing human and compute involvement rather than treating agents as “set and forget.”
My Thoughts
This research couldn’t be more timely. As we head into 2026, more and more agents are moving from experimentation to production, where reliability, safety, and measurable business value become the ultimate success criteria.
The need for systematic evaluation and strong observability is clear. The agent is rarely the true blocker. The challenge is ensuring its behavior remains reliable, auditable, and safe once it operates inside complex workflows.
Imagine an organization running a thousand agents at once. What would give you confidence that the agent you rely on is acting correctly?
If you’re working on bringing agents into the real world, this paper provides a clear, grounded understanding of what actually determines success. It’s a reminder that shipping an agent is not the finish line. It’s where the hard part begins.
— Royce