Most teams measuring AI agent performance are tracking the wrong things. Model accuracy on benchmarks. Prompt response latency. Token throughput. These are infrastructure metrics. They tell you whether the plumbing works. They tell you nothing about whether the agent is doing its job.

Production agent performance comes down to five metrics. Each one answers a specific operational question, has concrete benchmarks by maturity level, and changes predictably over The Recursive Loop cycles. Here’s what to measure, what “good” looks like, and how to use the numbers to decide what to build next.

Metric 1: Task Completion Rate

What it measures: The percentage of assigned tasks the agent completes correctly without human intervention.

This is the foundational metric. If an agent can’t finish its work, nothing else matters. Task completion rate tells you whether the agent’s schemas, skills, and context are sufficient for the work it’s doing.

How to measure it: Count tasks assigned to the agent. Count tasks completed correctly (output matched expected criteria, no human override required). Divide. Track per task type, not just overall. An agent with a 90% aggregate completion rate might be at 99% on routine tasks and 40% on complex ones. The aggregate hides the problem.

Benchmarks by maturity:

  • Month 1 (cycle 1): 60-70% completion rate. Normal. The agent is operating on initial schemas and skills that haven’t been refined by production data. The 30-40% that doesn’t complete becomes the pattern log for the first LEARN phase.
  • Month 3 (cycle 3): 80-85%. Three cycles of The Recursive Loop have addressed the highest-frequency gaps. Routine tasks are reliable. Complex tasks still generate flags and escalations.
  • Month 6 (cycle 6): 90-95%. The agent handles the vast majority of its domain. The remaining 5-10% are genuine edge cases: rare scenarios, novel situations, tasks that legitimately require human judgment.

Red flag: Completion rate plateaus below 80% after three cycles. This means the LEARN phase is broken; agents are surfacing patterns but nobody is encoding them. Or it means the domain is poorly bounded; the agent is being asked to do things outside its defined scope.

Metric 2: Human Intervention Rate

What it measures: The percentage of tasks where a human had to step in: to correct an agent’s output, override a decision, or complete a task the agent couldn’t finish.

This is the single best proxy for agent quality. Task completion rate tells you the agent finished the job. Intervention rate tells you the agent finished it correctly. An agent can “complete” a task by producing output that a human then has to review and fix. That’s not completion; that’s generating work.

How to measure it: Log every instance where a human modifies agent output, overrides an agent decision, or completes a task that was escalated. Categorize interventions: corrections (agent was wrong), completions (agent couldn’t finish), and overrides (agent’s decision was valid but human chose differently). Each category tells you something different.

  • Corrections mean the skills or schemas are inaccurate. Fix the encoding.
  • Completions mean the skills are incomplete. Add coverage.
  • Overrides might mean governance is miscalibrated, or the human disagreed with a correct decision. Investigate before changing anything.

Benchmarks by maturity:

  • Month 1: 25-35% intervention rate. Expected. The team is actively monitoring a new system and intervening more than strictly necessary. Some interventions are corrections; many are the team building trust.
  • Month 3: 12-18%. Trust is established for routine tasks. Interventions concentrate on complex scenarios and edge cases. The team intervenes on genuine gaps, not on anxiety.
  • Month 6: 5-10%. Interventions are rare and usually involve novel situations the agent hasn’t encountered. At this level, each intervention is a signal, a specific gap to encode in the next BUILD cycle.

Red flag: Intervention rate above 20% after month 3. Either the agent’s domain is too broad, its skills are too shallow, or the governance model requires human approval on tasks the agent should handle autonomously. Review the intervention log to identify which category dominates.

Metric 3: Time-to-Resolution

What it measures: Elapsed time from task initiation to task completion, compared against the human baseline for the same task type.

Time-to-resolution is the metric that makes the business case. If an agent handles tasks in 18 minutes that used to take 4.2 hours, that’s a number executives understand. It translates directly to throughput, capacity, and cost.

How to measure it: Timestamp task assignment and task completion. Calculate elapsed time. Compare to the historical human baseline for the same task type. Express as both absolute time (18 minutes vs. 4.2 hours) and ratio (14x faster). Track the trend over time; early cycles should show improvement as skills get refined and context gets richer.

Benchmarks by maturity:

  • Month 1: 40-60% of human time for routine tasks. Agents handle straightforward scenarios quickly but stall on anything requiring judgment or cross-referencing. Some tasks take longer than humans because the agent lacks the shortcuts experienced employees have internalized.
  • Month 3: 20-40% of human time for routine tasks. Business-as-Code artifacts are rich enough that agents handle most decisions without stalling. Complex tasks are still 60-80% of human time.
  • Month 6: 10-25% of human time for routine tasks. The agent’s accumulated context means it processes faster than even experienced employees on defined workflows. Complex tasks are at 30-50% of human time.

Red flag: Time-to-resolution isn’t improving between cycles. This means the BUILD phase isn’t encoding the right things. Agents are completing tasks, but they’re not getting faster because the skills don’t address the bottlenecks. Dig into where agents spend their time during task execution; the delays reveal which skills need refinement.

Metric 4: Cost Per Action

What it measures: The total cost to execute one agent action: token usage, API calls, compute, and amortized maintenance.

Every agent action costs money. LLM tokens, tool API calls, compute for processing, and the human time to maintain the system. Cost per action determines whether the agent is economically viable, and whether it stays viable as you scale.

How to measure it: Track per-task costs: tokens consumed (input and output), external API calls, compute time. Add amortized maintenance cost (the human hours spent on weekly LEARN cycles, schema updates, and skill refinements, divided across tasks handled). Express as a single per-action cost.

What “good” looks like: Cost per action should be 10-30% of the fully loaded human cost for the same task. A task that costs a $75/hour employee 30 minutes ($37.50 loaded) should cost the agent $3.75-$11.25 per execution, including all infrastructure and maintenance overhead.

What to watch:

  • Token costs trending up means agent context is growing unbounded. Prune. Not every schema and skill needs to load for every task. Build context selection logic that loads only what’s relevant.
  • API call costs spiking means the agent is making redundant tool calls. Review the execution trace for unnecessary steps. A well-designed skill should specify which tools to use and when, not leave the agent to figure it out by trial and error.
  • Maintenance costs dominating means the LEARN phase is consuming too much human time relative to the tasks handled. This usually corrects as the system matures; early cycles require more human investment per pattern encoded.

Benchmark: By month 6, total cost per action (including maintenance) should be under 25% of human cost. If it’s above 50%, the agent isn’t delivering enough economic value to justify its operation at scale.

Metric 5: Business Outcome Impact

What it measures: The downstream business result of agent operation: revenue generated, costs saved, customer satisfaction improved, throughput increased.

This is the metric that actually matters. The previous four tell you whether the agent works. This one tells you whether it matters.

How to measure it: Define the business outcome before deploying the agent. A sales agent’s outcome is pipeline value influenced and deals progressed. A support agent’s outcome is resolution satisfaction scores and ticket backlog reduction. An operations agent’s outcome is throughput increase and error reduction.

Map agent activity to business outcomes. The sales agent handled 400 lead qualification tasks this month, resulting in 45 qualified leads that generated $380K in pipeline. The support agent resolved 850 tickets with a 94% satisfaction score, reducing average resolution time from 6 hours to 22 minutes. The operations agent processed 1,200 procurement requests, catching 23 policy violations that would have cost $47K in overspend.

Benchmarks by maturity:

  • Month 1: Establish the baseline. What were business outcomes before the agent? Document current state precisely; you can’t measure improvement without a starting point.
  • Month 3: 20-40% improvement in the primary business metric. Pipeline velocity up. Ticket resolution time down. Processing throughput up. These aren’t aspirational; they’re the typical range when The Recursive Loop is running.
  • Month 6: 50-100% improvement in the primary business metric. The compounding effect of six cycles means agents aren’t just handling more; they’re handling better. Skills are refined enough to capture nuance that humans sometimes miss.

Red flag: Strong operational metrics (high completion rate, low intervention rate) but flat business outcomes. This means the agent is doing work efficiently but the work doesn’t move the needle. Reassess the domain. The agent might be optimizing a workflow that doesn’t impact the metric you care about.

Reading the Metrics Together

Individual metrics mislead. The power is in combinations.

High completion + flat coverage = agents work but aren’t learning. The system handles what it already knows but isn’t expanding to new scenarios. The LEARN phase needs attention; agents are surfacing patterns that nobody is encoding.

Expanding coverage + rising errors = encoding too fast. New skills are being added without adequate testing. Each BUILD cycle should include validation: does the new skill conflict with existing ones? Does the schema change break downstream agents?

Declining time-to-resolution + expanding coverage = the system is compounding. This is the combination that signals scaling is working. Agents are getting faster at existing tasks while also handling new ones. The Recursive Loop is functioning as designed.

Low intervention rate + high cost per action = over-engineered governance. The agent rarely needs human help, but it’s burning tokens on unnecessary context loading or redundant tool calls. Simplify the execution path.

Low cost + declining business impact = diminishing domain. The agent is cheap to run but the domain it operates in matters less than it used to. Business conditions changed. Reassign the agent to a higher-impact domain or retire it.

Using Metrics to Decide What to Build Next

Every LEARN cycle produces more patterns than you can encode in one BUILD cycle. Metrics tell you which patterns to prioritize.

Prioritize by intervention rate impact. If encoding a new skill would eliminate 15% of current interventions, it moves to the top of the BUILD queue. Intervention rate is the most expensive metric; every intervention costs human time and breaks the autonomy that makes agents valuable.

Prioritize by time-to-resolution bottleneck. If agents spend 60% of their task time on one step (say, cross-referencing customer history) a skill that pre-loads relevant history could cut time-to-resolution by 40%. Time savings compound across every task.

Prioritize by business outcome leverage. If the sales agent’s qualification skill is at 90% accuracy but the 10% it misses are disproportionately high-value deals, refining that skill has outsized business impact relative to its effort.

Deprioritize by frequency. A pattern that appeared three times doesn’t warrant a new skill. A pattern that appeared 50 times does. Encoding rare edge cases creates maintenance burden without proportional value.

The metrics framework isn’t a scorecard you review once a quarter. It’s the decision engine for The Recursive Loop. Every BUILD cycle starts with the same question: which encoding would improve which metric by how much? The answer comes from the numbers, not from roadmap planning sessions.

Measure these five things. Read them together. Build based on what they tell you. That’s how agent systems reach Escape Velocity, not through ambition, but through disciplined, metric-driven compounding.

Frequently Asked Questions

What's a good task completion rate?

For well-defined tasks in a mature domain, 85-95%. For complex tasks requiring judgment, 70-80% with the remainder escalated to humans. Below 60% means the agent's context is incomplete or its governance is too restrictive. Above 95% on everything means you're probably not measuring edge cases.

How do I calculate the ROI of an agent?

Time saved x cost of that time, minus agent operating costs (compute, API calls, maintenance). A sales agent that saves each rep 10 hours/week at $75/hour loaded cost = $3,000/month per rep. If the agent costs $500/month to operate, the ROI is 6x. Most NimbleBrain deployments show positive ROI within the first Recursive Loop cycle.

How often should I review agent metrics?

Daily dashboards for operational metrics (completion rate, error rate). Weekly reviews for trend analysis (intervention rate trending up or down). Monthly for business outcome metrics (revenue impact, cost savings). Quarterly for strategic assessment (is this agent still the right investment?).

Mat GoldsboroughMat Goldsborough·Founder & CEO, NimbleBrain

Ready to put AI agents
to work?

Or email directly: hello@nimblebrain.ai