When the Harness Trains the Model Back

I argued recently that the loop is the moat: the model is rented, and what compounds is the loop around it, where the harness improves the model and the model improves the harness. That post made the case from how the labs operate. This one makes it from the research.

Four papers put numbers on the claim. They answer the questions the strategy could not. How much does the harness actually move capability? How tight should the coupling be? And what keeps the loop from fooling itself?

Read in order, they build on each other. Each one ties the model and the harness more tightly than the last.

First, Why This Matters at All

Start with a simple, uncomfortable result.

Scaffold Effects on GAIA, a pre-registered study from June 2026, asked a narrow question. Hold the model fixed. Hold the tasks fixed. Only change the scaffold, the agent loop around the model. How much does the score move?

It tested three scaffolds (a plain ReAct loop, a multi-agent planner-actor-rater, and a planner-then-executor) across five frontier models on the GAIA benchmark.

The answer: scaffold choice alone moved measured accuracy by up to 28 points for a single model. Same weights, same questions, different harness, 28 points of difference.

It gets sharper. The study expected stronger models to care less about the scaffold. The opposite held. The most capable model in the lineup gained the most from a better scaffold on the hard tasks. The prediction was falsified in direction.

There is a behavioral tell too. The plain ReAct loop fired roughly three times the tool calls of the structured scaffolds, yet recovered from mid-task errors least often. More flailing, less recovery.

The takeaway is blunt. A published capability score is not a property of the model. It is a property of the model and its harness, tangled together. You cannot report one without the other.

If the harness moves the number that much, then improving the harness is not housekeeping. It is capability work. That is the case for co-evolution, in one number.

Paper 1: Freeze the Model, Evolve the Harness

So how far can the harness alone take you?

AHE (Agentic Harness Engineering), from Fudan University and collaborators, answers cleanly. Keep the base model frozen. Automatically evolve the harness around it: system prompts, tools, middleware, skills, memory.

The loop will sound familiar from the first post. Run the agent, distill the traces, propose an edit, and pair every edit with a prediction the next round must confirm. AHE calls this turning each edit into a "falsifiable contract." It is the gate, built into the loop.

Ten iterations took pass@1 on Terminal-Bench 2 from 69.7% to 77.0%. The model never changed. That beat the human-built Codex harness (71.9%) and two self-evolving baselines.

Two results matter more than the headline number.

The harness transfers. Frozen after evolution, it carried over to SWE-bench-Verified with the highest aggregate success while using 12% fewer tokens, and it added 5 to 10 points across three other model families. The harness learned general engineering habits, not benchmark tricks.

The gain was not in the prompt. Ablations traced the improvement to tools, middleware, and long-term memory. The system prompt alone actually regressed. The lesson lands on the parts of the harness people tend to ignore.

Taken together, that is about as far as the harness can go on its own. It is a long way. But the model itself never got any better.

Paper 2: Co-Evolve One Piece With the Policy

Now let the model move.

INSPO, from the Cambridge Language Technology Lab, makes a small but pointed observation. When you train an agent with reinforcement learning, you give it an instruction. That instruction is usually static and hand-written. But the best instruction for a model changes as the policy improves. A prompt tuned for the model on day one is stale by day ten.

So INSPO co-evolves the instruction with the policy. It keeps a live population of instruction candidates. Reward gets attributed to each one. Weak instructions are pruned. New ones are bred by reflection on what the current policy actually does.

On Qwen-2.5-3B it reached a 38.2% average exact-match score, beating a strong RL-with-search baseline by about 6 points, with little extra compute.

It is a narrow slice of co-evolution: one harness component, the instruction, moving in step with the model. But it proves the core point. If the model keeps learning but the harness stays fixed, the harness slowly falls out of sync with it.

Paper 3: Co-Evolve the Whole Training Harness

The deepest version lets both sides move at once.

EvoTrainer, from Alibaba's Tongyi Lab and collaborators, names a problem hiding in plain sight. Automated training usually searches for a better recipe while leaving the diagnostics fixed. But in agentic RL, a scalar reward hides diverse failure modes. If your only instrument is the score, you cannot see why the agent failed, so you cannot fix it.

EvoTrainer co-evolves the policy and the training-side harness that interprets it. When the existing metrics, analyzers, and backtests are not enough to explain a failure, it revises the diagnostics themselves. It evolves its own instruments, then keeps a memory of what worked for later runs.

Across math, code, and software engineering, it matched or beat human-engineered RL references under the same data and protocol. The largest gain was on long-horizon software engineering, where it exceeded the human-built baseline by +4.39 points on its SWE setting.

The line that stays with me is the framing. Autonomous training should move beyond recipe search toward joint evolution of the policy and the harness that reads it. That is the whole thesis of this post, stated as a method.

The Papers, In One View

Paper	Model weights	What evolves	The gate
Scaffold Effects (motivation)	Frozen	Nothing (measurement)	Pre-registered hypotheses
1. AHE	Frozen	Runtime harness	Each edit predicts its own result
2. INSPO	Trained	One component (the instruction)	Reward attribution, prune the weak
3. EvoTrainer	Trained	Policy plus training diagnostics	Backtests block bad branches

The coupling tightens as you read down. The discipline that holds it together never changes.

The Gate, Again

This is the same lesson the self-improving harness post ended on. None of these systems trusts the model's opinion of its own edit.

AHE forces every edit to predict its result, then checks it next round. INSPO prunes any instruction the reward does not support. EvoTrainer backtests interventions so an invalid high-scoring branch never gets promoted. Even Scaffold Effects is gated, pre-registering its hypotheses before collecting a single data point.

Co-evolution without a gate is not improvement. It is two systems drifting together and calling it progress. The gate is what makes the drift converge on something real.

One Honest Distinction

A word of caution on the word "harness," because these papers use it at two different layers.

AHE evolves the runtime harness: the prompts, tools, and memory the agent uses while it works. EvoTrainer evolves the training-side harness: the metrics and diagnostics used while the model learns. Same word, different place in the stack.

That is not nitpicking. It tells you co-evolution is happening in two places at once. The harness you ship is improving, and the harness you train with is improving, and both feed the same model.

What This Means for Builders

Report the scaffold, not just the model. A benchmark score means little on its own, since the same model can move up to 28 points just from a different harness. So always say which harness you used. And to compare two models fairly, give them the same harness. Otherwise the harness difference leaks into the result, and you can no longer tell which model is actually better.

Plan for your harness to go stale. INSPO's point is that a static harness decays as the model improves. The prompt you tuned last quarter is fighting last quarter's model. Budget to evolve it, not just to write it.

Choose how tightly to couple. These are a menu, not a mandate. Freezing the model and evolving the harness (AHE) is the cheapest option, and it already transfers across models and benchmarks. You only move to co-evolving weights when the harness ceiling is in reach and you can afford to train. Start simple.

Gate before you keep. This is the rule that matters across every approach: no edit, instruction, or training branch should ship unless it passes a test that genuinely proves it improves the outcome. Without that kind of gate, the loop does not get better over time. It only drifts, while giving the false impression that progress is being made.

If you want the strategy side of this, building and owning that loop inside a company, I made that case in why the loop is the moat. This post is the evidence underneath it.

Where This Heads

The strategy and the research now point the same way. The labs run this loop because it pays. The controlled studies show why: the harness moves capability as much as the model choice itself; as the model improves, the harness should evolve with it; and a gate is what ensures the gains are real, not just assumed.

So the question stops being which model you picked. It becomes how tightly your model and your harness learn from each other, and whether you can prove the loop is improving rather than just convincing itself.

Owning that loop is the moat. These papers are the reason it holds.