When the Agent Fixes Its Own Harness
I wrote earlier about harness engineering: the shift from prompt engineering to building the whole system around the model. Tools, verification loops, guardrails, memory. The line that stuck came from Mitchell Hashimoto:
"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
That line carries a quiet assumption. The "you" is a human. A person watches the agent fail, finds the pattern, and writes the fix into the harness.
Here is the step further. What if the agent does that itself?
Two papers from June 2026 take this seriously. Both let an agent rewrite its own harness from its own mistakes. No human in the loop. The results are strong enough that this stops being a thought experiment.
Why Hand-Built Harnesses Hit a Wall
A harness is everything around the model. Prompts, tools, memory, control rules, recovery steps. It is what turns raw model output into reliable agent behavior.
The problem is that harnesses are still built by hand, and they are model-specific.
The same harness that works great for one model can be mediocre for another. Different models have different habits, different failure modes, different blind spots. One model forgets to write the output file. Another gets stuck retrying the same broken command. A third loses its environment settings between shell calls.
So every new model wants its own tuned harness. And models now ship constantly. Hand-tuning a bespoke harness for each one does not scale. The Self-Harness paper calls this "increasingly costly and untenable."
The natural question follows. If the agent already produces a detailed record of every mistake it makes, why does a human have to read those records and write the fix?
Paper One: The Agent That Patches Itself
Self-Harness, from Shanghai AI Laboratory, runs a simple loop. A fixed model improves the harness around itself, with no human engineer and no smarter external model helping out. Just the agent, its own traces, and a test gate.
The loop has three stages.
1. Weakness Mining. Run the agent on a set of tasks. Collect the traces. Group the failures by their real cause, not just the surface symptom. Two runs can both "time out" for completely different reasons, so the clustering looks at why the agent behaved the way it did, not only that it failed.
2. Harness Proposal. Feed those failure patterns back to the same model and ask it to propose small, targeted edits to its own harness. The rule is diverse but minimal. Each edit fixes one specific failure mechanism. No sweeping rewrites.
3. Proposal Validation. Test every candidate edit. Keep it only if it improves performance without making anything else worse. This is a conservative gate. An edit that helps one set of tasks but breaks another gets rejected, even if the total score goes up.
Edits that pass get merged. The new harness becomes the starting point for the next round.
What It Found
The team ran this on Terminal-Bench-2.0 with three models from different families. On the held-out tasks, the ones the system never saw while proposing edits, the gains were large:
| Model | Before | After |
|---|---|---|
| MiniMax M2.5 | 40.5% | 61.9% |
| Qwen3.5 | 23.8% | 38.1% |
| GLM-5 | 42.9% | 57.1% |
Same model. Same tools. Same budget. Only the harness changed. That is up to a 138% relative jump for the weakest starting point.
The most interesting result is not the numbers. It is that each model got different fixes.
MiniMax learned to create the required output file early and stop runaway tool loops. Qwen learned to check dependencies up front and recover the artifact after a tool error instead of deleting it. GLM-5 learned to keep its environment settings alive across shell commands and move from exploring to building sooner.
The agent did not just bolt on generic "be more careful" instructions. It turned each model's specific weakness into a concrete, executable change. That is exactly what a good human harness engineer does, done automatically.
Paper Two: Evolving the Harness Like a Population
HarnessX, from the Darwin Agent Team, is more ambitious. It treats the harness as a first-class object you can compose, adapt, and evolve. Three ideas stand out.
Compose. The harness is broken into typed, swappable parts across nine dimensions: context, tools, memory, control, safety, and so on. You can drop in or pull out a part like a building block without breaking the rest. This matters because evolution needs clean edit surfaces.
Adapt. An engine called AEGIS drives the changes. Its key move is to treat harness evolution as a kind of reinforcement learning, just over text and code instead of numbers. The harness is the state. An edit is an action. The trace and its score are the reward.
That framing is useful because it predicts the failure modes. If harness evolution is RL, then it inherits RL's classic problems:
- Reward hacking. The agent edits the harness to game the scorer, like sneaking the answer into a prompt. HarnessX guards against this with a Critic stage.
- Catastrophic forgetting. A fix for one problem quietly breaks another. Guarded by a strict regression gate.
- Under-exploration. The agent only ever makes tiny safe edits and plateaus. Guarded by a Planner that deliberately considers bigger structural changes.
So the design is not ad-hoc. Each defense exists because the RL framing said that risk was coming.
Evolve, together with the model. This is the part that goes furthest. HarnessX does not stop at improving the harness. It feeds the same execution traces back as training signal for the model too. The harness gets better, and the model gets better, in the same loop.
What It Found
The team tested three model families, Claude Sonnet 4.6, GPT-5.4, and Qwen3.5-9B, across five benchmarks (ALFWorld, GAIA, WebShop, τ³-Bench, and SWE-bench Verified). That gives 15 model-benchmark setups.
| Result | Value |
|---|---|
| Setups improved | 14 of 15 |
| Average gain | +14.5% |
| Peak gain | +44.0% (Qwen3.5-9B on ALFWorld) |
| Same benchmark, strongest model | +11.2% (Sonnet 4.6 on ALFWorld) |
| Near-ceiling benchmark | +1.1% (τ³-Bench) |
| Bonus from co-evolving the model | ~+4.7% |
The gains were biggest where the model was weakest. On ALFWorld, the smallest model jumped +44.0% while the strongest gained +11.2%. A good harness fills the gaps a weak model cannot fill on its own.
And the edits were not random. Most changes landed on context assembly and tools: how information reaches the model, and which tools it can reach. The harness learns to fix what is actually in the way.
The Shared Pattern
Strip away the differences and both papers describe the same discipline.
- Run the agent and record everything. The trace is the raw material.
- Find the recurring failure, not the one-off. Cluster mistakes by cause.
- Propose a small, targeted fix to the harness, tied to that specific cause.
- Test it against a regression gate. Keep it only if nothing else breaks.
- Repeat.
This is Hashimoto's loop, automated. Find a mistake, engineer a fix so it never happens again, except now the agent runs the loop on itself.
The discipline that keeps it safe is the gate. Neither paper trusts the model's opinion of its own edit. An edit ships only when a deterministic test says it did not break anything. The model gets to propose. The test decides what survives.
That split matters. It is the difference between an agent that improves and an agent that convinces itself it improved.
What This Means for Builders
The mindset shift compounds. Harness engineering already moved the burden from "wait for a better model" to "improve your system." Self-improving harnesses go one more step. You build the loop once, and the system improves itself from there.
The moat gets deeper. I argued before that a harness is a moat because you cannot download one. You have to build, fail, and rebuild it. A self-improving harness is a harder moat still. It is not a static asset. It is a process that keeps compounding on your own traffic and your own failures.
Co-evolution is the real prize. HarnessX's loop, where the harness and the model improve together, is where this is heading. Frontier labs already post-train models on their own harnesses. These papers show the loop can run continuously and automatically, not just at release time.
But keep the gate honest. Both papers are careful, and so should you be. These results are on benchmarks with clean verifiers. The edits are small and reversible. The authors are explicit that open-ended self-improvement on higher-stakes systems needs stronger gates than "the score went up." A self-improving system is only as safe as the test that decides what it keeps.
Where This Heads
Harness engineering moved the value from the model to the system around it. Self-improving harnesses move it again, from the system you build to the loop that keeps building it.
The best builders were never the best prompt writers. For a while, they were the best system builders. The next edge is building the system that improves its own system, and being disciplined enough to gate what it keeps.
The agent that fixes its own mistakes is no longer the goal. It is the starting point.
Enjoyed this post?
If this brought you value, consider buying me a coffee. It helps me keep writing.