You’re running The Recursive Loop. Agents are operating in production. The team is encoding new skills every week. The system is, theoretically, improving. But how do you know? How do you prove it? And more importantly, how do you know when the loop is working well, when it’s stalling, and when you should push harder versus consolidate?

You need metrics. Not vanity metrics that make a dashboard look good. Operational metrics that tell you whether the system is genuinely getting better at handling real work.

Here’s the framework.

The Four Dimensions of Recursive Improvement

Recursive improvement is multi-dimensional. Tracking a single metric (accuracy, task count, error rate) gives you a flat picture of a three-dimensional process. These four dimensions, tracked together, give you the full picture.

1. Skill Coverage

What it measures: The ratio of scenarios the system can handle to the total scenarios you’ve identified.

Formula: Scenarios handled autonomously / Total identified scenarios

Why it matters: Skill coverage is the most direct measure of system capability. It answers the question: “Of all the things our business needs agents to do, how many can they actually do?”

What good looks like:

  • Week 1: 15-20% coverage. You started with 10 skills covering the basics. The rest is unknown territory.
  • Month 1: 35-45% coverage. Four cycles of encoding have filled the highest-frequency gaps.
  • Month 3: 65-75% coverage. The system handles most common scenarios and many exceptions.
  • Month 6: 80-90% coverage. Agents handle the vast majority of operational work.

What bad looks like:

  • Coverage stalls below 50% after month 2. This usually means the team is encoding low-value patterns instead of high-frequency ones. Reprioritize based on pattern frequency.
  • Coverage jumps to 90% in the first month. This means either the domain is trivially simple or the team is undercounting total scenarios. Audit your scenario inventory.
  • Coverage decreases. This means the business changed faster than the encoding caught up. New products, new customer types, or new processes created scenarios that weren’t in the original inventory. Update the denominator and encode the new gaps.

How to track it: Maintain a scenario registry: a living document that catalogs every distinct scenario agents might encounter. Each time agents surface a new pattern, add it to the registry. Each time the team encodes a new skill, mark the corresponding scenarios as “covered.” The ratio is your skill coverage percentage.

2. Intervention Rate

What it measures: The percentage of tasks that require human override, correction, or completion.

Formula: Tasks requiring human intervention / Total tasks processed

Why it matters: Intervention rate is the metric that matters most to operations. It answers: “How much human labor is this system actually saving?” A system that handles 1,000 tasks per week but requires intervention on 400 of them is saving less than a system that handles 500 tasks per week with intervention on 25.

What good looks like:

  • Week 1: 60-70% intervention rate. Agents escalate most things. Normal for a new system.
  • Month 1: 35-45% intervention rate. The highest-frequency patterns have been encoded. Agents handle the routine autonomously.
  • Month 3: 15-20% intervention rate. Agents handle the routine, most exceptions, and many edge cases. Humans intervene on genuinely novel situations.
  • Month 6: 5-10% intervention rate. Agents handle nearly everything. Human intervention is reserved for policy decisions, not capability gaps.

What bad looks like:

  • Intervention rate stays above 50% after month 2. Agents aren’t getting enough new skills to handle what they encounter. Check if the LEARN phase is producing actionable patterns and if the team is encoding them.
  • Intervention rate drops to 5% in the first month. Either the domain is very narrow or agents are handling things they shouldn’t be, completing tasks with low confidence instead of escalating. Audit agent confidence scores on completed tasks.
  • Intervention rate bounces between weeks, 20% one week, 40% the next. This signals inconsistent operational load. Some weeks bring routine work, others bring exceptions. Track the trend line, not week-to-week fluctuations.

How to track it: Every task gets a disposition: autonomous completion, flagged completion (agent completed with uncertainty), or human intervention. The intervention rate is human interventions divided by total tasks. Track weekly.

3. Pattern Discovery Velocity

What it measures: The number of new, actionable patterns agents surface per week.

Formula: New unique patterns surfaced / Week

Why it matters: Pattern discovery velocity is a leading indicator. It tells you whether the loop has fuel. A high discovery velocity means agents are finding lots of gaps: which means there’s plenty to encode and the system will keep improving. A declining velocity means the system is approaching the boundary of what agents can discover through normal operation.

What good looks like:

  • Week 1-4: 10-20 new patterns per week. Agents encounter a lot of unknowns. The pattern log is full.
  • Month 2-3: 8-15 new patterns per week. The high-frequency gaps are filled. Agents discover more subtle patterns now, compound scenarios, edge cases, context-dependent variations.
  • Month 4-6: 5-10 new patterns per week. The system is mature. New patterns come from business changes (new products, new customer types) rather than gaps in initial encoding.
  • Ongoing: 3-5 new patterns per week. The steady state. Patterns come from business evolution, not system immaturity.

What bad looks like:

  • Velocity drops below 3 per week before month 3. The system isn’t surfacing enough patterns, which means either the operational volume is too low (agents aren’t encountering enough variety) or agents aren’t configured to surface uncertainty (they’re failing silently instead of flagging).
  • Velocity stays above 15 per week past month 3. The team isn’t encoding fast enough. Patterns accumulate but don’t get converted to skills. The LEARN phase is working but the BUILD phase is bottlenecked.
  • Velocity is zero. Agents aren’t surfacing patterns at all. This is a system design problem, agents need explicit uncertainty-surfacing behavior. If they’re not flagging what they don’t know, the loop can’t run.

How to track it: Count distinct new patterns in the weekly pattern report. Deduplicate against previously reported patterns. A pattern is “new” if it represents a scenario category the team hasn’t seen before, not a repeat of a known gap.

4. Time-to-Encode

What it measures: The elapsed time from when a pattern is discovered to when the corresponding skill or schema change is deployed in production.

Formula: Hours from pattern flagged to skill deployed

Why it matters: Time-to-encode is the loop’s cycle speed. Faster encoding means faster improvement. It’s also the metric that best reflects team capability: a team that can go from pattern to deployed skill in 4 hours is operating at a fundamentally different level than one that takes 2 weeks.

What good looks like:

  • Month 1: 24-48 hours. The team is learning the encoding process. Writing skills takes time. Schema changes require thought.
  • Month 2-3: 8-16 hours. The team has templates. They know the format. They’ve encoded enough skills that new ones follow established patterns.
  • Month 4-6: 2-8 hours. Encoding is routine. The team can go from Friday pattern review to Monday deployment without breaking stride.
  • Mature state: 1-4 hours for standard patterns. Complex patterns (new entity types, multi-skill sequences) still take 8-16 hours.

What bad looks like:

  • Time-to-encode increases after month 2. The team hit complexity they weren’t ready for. Common cause: skill interdependencies. As the system grows, new skills interact with existing ones in ways that require more testing and validation. This is normal but needs to be managed. Invest in skill testing infrastructure.
  • Time-to-encode stays above 48 hours past month 2. The encoding process has too much friction. Too many approvals, too much review, too much ceremony. Simplify the path from pattern to deployed skill. In a well-run loop, encoding a standard skill should be a single-person, single-session task.
  • Time-to-encode is under 1 hour from the start. The team is writing shallow skills: pattern matching without real decision logic. These skills handle the surface scenario but break on variations. Audit skill quality alongside encoding speed.

How to track it: Timestamp when the pattern enters the review queue. Timestamp when the corresponding skill passes validation and deploys. The difference is your time-to-encode.

The Dashboard: What to Put on the Wall

Build a weekly dashboard with these four metrics. One page. No decoration. Numbers and trend lines.

MetricThis WeekLast WeekTrend (4-week)Target
Skill Coverage58%54%Up 4%/week80% by month 4
Intervention Rate28%33%Down 5%/weekBelow 15% by month 3
Pattern Discovery1214Steady at 12-15Steady or increasing
Time-to-Encode12 hours18 hoursDown 3 hrs/weekUnder 8 hours by month 3

The trend column is more important than the current value. A skill coverage of 40% that’s climbing 5% per week is healthier than 70% coverage that’s been flat for a month. Trends tell you whether the loop is producing results. Current values tell you where you are. You need both.

When to Accelerate the Loop

Accelerate. Invest more team time in encoding, increase pattern review frequency, or add encoding capacity: when you see these signals:

Pattern backlog growing. Agents surface 15 patterns per week but the team only encodes 5. The backlog means the system could improve faster than it’s currently improving. Add encoding capacity.

Time-to-encode decreasing while pattern quality stays high. The team is getting faster without cutting corners. This is the ideal time to push volume. They can handle more.

Intervention rate declining on a clear slope. The loop is working. Each cycle produces measurable improvement. Push harder while the momentum is there.

Business conditions are shifting. A new product launch, a new market, a regulatory change. These create a burst of new scenarios that agents will encounter. Get ahead of the wave by accelerating encoding during the transition period.

When to Consolidate

Consolidate: reduce encoding pace, focus on quality and testing, clean up technical debt in skills and schemas: when you see these signals:

Skill conflicts emerging. Two skills give contradictory guidance for the same scenario. This happens when the library grows fast without coordination. Pause and resolve the conflicts before adding more.

Agent confidence on existing skills is dropping. New skills are interfering with existing ones. The system got wider but less reliable. Consolidate by testing and refining existing skills before adding new ones.

Pattern discovery velocity is below 5 per week. The system has absorbed most of the discoverable patterns in its current operational scope. Adding skills won’t improve much because there aren’t many gaps left. Shift focus to refining existing skills and extending operational scope.

Time-to-encode is increasing despite team experience. The skill library has grown complex enough that new additions require careful integration testing. This is natural at scale. Invest in tooling and testing infrastructure before pushing more volume.

The team is burning out. Weekly encoding on top of regular responsibilities can exhaust a team if the pace is too aggressive. A sustainable loop runs indefinitely. A sprint-pace loop runs for eight weeks and then the team stops doing pattern reviews entirely. Consolidate to a pace the team can maintain for a year.

What These Metrics Don’t Tell You

These four dimensions measure whether the loop is running and whether the system is improving. They don’t measure whether the improvements matter.

A system can hit 90% skill coverage while covering the wrong 90%. It can achieve a 5% intervention rate while agents handle tasks at 60% quality. Quantity metrics need a quality overlay.

Add periodic quality audits: sample 20 agent-completed tasks per week and have a domain expert grade them. Are agents making the right decisions? Are they following skills accurately? Are the outcomes correct, not just the process? Quality audits catch degradation that quantity metrics miss.

Context Engineering isn’t just about adding more skills; it’s about adding the right skills with the right precision. The metrics framework tells you how fast the loop is running. Quality audits tell you whether it’s running in the right direction.

Measure both. The four dimensions keep the loop turning. The quality audits keep it honest.

Frequently Asked Questions

What's the single most important metric to track?

Intervention rate. It captures the real question: how often do humans need to step in? A declining intervention rate means the system is genuinely improving at handling real work. All other metrics are leading indicators that predict where intervention rate is heading.

How do I know if the loop has stalled?

Three signals: pattern discovery velocity drops below 3 new patterns per week for two consecutive weeks, time-to-encode stops decreasing or starts increasing, and intervention rate plateaus for more than a month. Any one of these warrants investigation. All three together means the loop has stopped producing value and needs to be reset.

Should I track these metrics from day one?

Track intervention rate and pattern discovery velocity from day one, they're immediately useful. Add skill coverage and time-to-encode after the second cycle, once you have enough data to calculate baselines. Don't delay deployment to build a metrics dashboard. Deploy, track what matters, and build the dashboard from operational data.

Ready to encode your business
for AI?

Or email directly: hello@nimblebrain.ai