From Prompt Engineering to Context Engineering to Harness Engineering
Three years. Three eras. Each one redefined what it means to build with AI.
In 2023, the belief was: if I phrase this prompt right, the AI will nail it.
In 2025, it shifted to: if I give the model the right context, it will figure it out.
In 2026, we landed here: if I build the right system around the model, it will deliver reliably.
The discipline matured from wordsmithing to information design to systems engineering. Each phase moved the locus of value further from the model and closer to the system around it.
The model is no longer the differentiator. The system is.
Era One: Prompt Engineering (2022 to mid 2025)
Prompt engineering was the art of crafting the right input text to get useful output from an LLM. Zero shot, few shot, chain of thought, role prompting. The idea was simple: if you asked the question well enough, the model would give you a great answer.
How It Started
The seeds were planted in May 2020, when OpenAI launched GPT 3 with 175 billion parameters. The accompanying paper, "Language Models are Few Shot Learners" (Brown et al.), showed that models could learn tasks from examples placed directly in the prompt, no retraining required. That was the moment prompt design became a meaningful discipline.
In January 2022, Wei et al. published the chain of thought prompting paper, showing that asking models to reason step by step dramatically improved performance. PaLM 540B went from 18% to 57% on the GSM8K math benchmark just by adding intermediate reasoning steps to the prompt.
Then in November 2022, ChatGPT launched. Overnight, prompt engineering went from a research technique to a mainstream skill.
Peak Hype
2023 was the peak. Indeed searches for "prompt engineer" went from 2 per million U.S. searches in January to 144 per million by April. "Prompt" was the runner up to Oxford's word of the year. Salary postings for prompt engineers reached six figures.
By 2024, researchers had catalogued 58 distinct prompting techniques in a systematic survey (Schulhoff et al.). The discipline had real depth.
Why It Hit a Ceiling
For all its value, prompt engineering had structural limits.
Prompts are static and brittle. Andrew Ng described it as "writing an essay without the option to use backspace." You get one shot. The model produces a single pass output with no ability to self correct.
No memory, no tools, no grounding. Prompt engineering treated the model as a standalone oracle. It had no access to real time data, no ability to use tools, no persistent memory. Everything had to fit in a single input.
Clever phrasing stopped being the bottleneck. As models improved, the gap between a "good" prompt and a "great" prompt narrowed. The real bottleneck shifted from how you ask to what information the model has access to.
Weaker models with better systems beat stronger models with better prompts. Ng demonstrated at Sequoia's AI Ascent 2024 that GPT 3.5 in an agentic workflow outperformed GPT 4 with zero shot prompting on coding benchmarks. The system mattered more than the model.
By mid 2025, the writing was on the wall. Fast Company reported that prompt engineering as a standalone role "has all but disappeared," with strong AI prompting now treated as an expected skill rather than a dedicated job.
As Ethan Mollick put it in late 2023: "I have been saying prompt engineering is going away." He was early, but he was right.
Era Two: Context Engineering (mid 2025 to early 2026)
Context engineering is the practice of designing the full information environment that surrounds an LLM during a task. Not just the prompt, but the data, tools, memory, examples, and structure the model needs to succeed.
The shift in framing is significant. Prompt engineering asks: how do I phrase this? Context engineering asks: what does the model need to know?
One Week in June 2025
The term crystallized in a single week.
On June 19, 2025, Shopify CEO Tobi Lutke posted on X:
"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
The post got 1.9 million views.
One week later, on June 25, Andrej Karpathy amplified the message:
"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."
He added a clarification that captured the distinction perfectly: "People's use of 'prompt' tends to (incorrectly) trivialize a rather complex component. You prompt an LLM to tell you why the sky is blue. But apps build contexts (meticulously) for LLMs to solve their custom tasks."
Two days later, on June 27, Simon Willison wrote: "I think 'context engineering' is going to stick. Unlike 'prompt engineering', it has an inferred definition that's much closer to the intended meaning." He pointed out the core problem with the old term: prompt engineering "suffers from a thing where many people's inferred definition is that it's a laughably pretentious term for typing things into a chatbot."
By June 30, Phil Schmid published the first comprehensive definition: "Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give an LLM everything it needs to accomplish a task."
In less than two weeks, the term went from Lutke's tweet to a formalized definition. By July, the first academic survey analyzing 1,400+ papers had formalized context engineering as a distinct discipline.
Karpathy's Analogy
Karpathy offered an analogy that made the concept click for engineers.
The LLM is a CPU. The context window is RAM. Your job is to be the operating system, loading working memory with exactly the right code and data for each task.
Just as an OS carefully manages what fits into RAM for efficient execution, context engineering curates the model's working memory. Too little context and the model lacks what it needs. Too much and performance degrades while costs climb.
What Changed in Practice
The focus shifted from "how you ask" to "what you provide."
RAG became standard infrastructure. Retrieval augmented generation went from a research concept to a production requirement. Instead of hoping the model knew the answer, you fetched relevant documents and injected them into the context window.
Tool use became a first class concern. Models gained the ability to call functions, query databases, and interact with external systems. The question was no longer just what to say to the model, but what capabilities to give it.
Memory systems emerged. Short term memory (conversation state), long term memory (user preferences and past interactions), and hierarchical memory architectures that layer across sessions.
Context strategies were formalized. LangChain defined four patterns: write (persist context externally), select (retrieve what is relevant via RAG), compress (summarize to stay within token limits), and isolate (separate contexts for different agents or tasks).
Industry Validation
The shift was not just a developer community trend. The institutions followed.
Gartner stated in July 2025: "Context engineering is in, and prompt engineering is out. AI leaders must prioritize context over prompts." They predicted that 40% of enterprise apps will feature task specific AI agents by late 2026.
MIT Technology Review named context engineering one of the defining shifts of 2025 in their year end review.
LangChain's 2025 State of Agent Engineering report found that 57% of organizations now have AI agents in production, but 32% cite quality as the top barrier. The root cause in most cases was not model capability. It was poor context management.
Why It Hit a Ceiling
Context engineering solved the information problem. It did not solve the reliability problem.
Giving the model the right context is necessary but not sufficient. In production, agents also need guardrails to prevent harmful actions, verification loops to catch mistakes, error recovery to handle failures gracefully, and architectural constraints to prevent entire classes of errors.
Context is one layer. The full system needs more.
Era Three: Harness Engineering (early 2026 to present)
Harness engineering is the design and implementation of the full system that governs how an AI agent operates. Tools, architectural constraints, verification loops, feedback mechanisms, guardrails, and lifecycle management.
The harness is not the agent. It is the infrastructure that makes the agent reliable.
The Naming Moment
The concept existed before the term. Anthropic published "Effective harnesses for long running agents" in November 2025, describing a two agent architecture with initializer agents and coding agents, complete with permission models, hooks, context compaction, and tool dispatch. The practice was already real. It just did not have a name yet.
That changed on February 5, 2026, when Mitchell Hashimoto, co founder of HashiCorp and creator of Terraform, published a blog post describing his AI adoption journey. He wrote:
"I don't know if there is a broad industry accepted term for this yet, but I've grown to calling this 'harness engineering.' It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
Six days later, on February 11, OpenAI published "Harness engineering: leveraging Codex in an agent first world." It documented how a three engineer team used harness engineering to produce roughly a million lines of code with about 1,500 pull requests merged, an average throughput of 3.5 PRs per engineer per day, with zero manually typed code.
On February 18, Ethan Mollick published "A Guide to Which AI to Use in the Agentic Era", organizing his entire framework around "Models, Apps, and Harnesses." This normalized the term beyond the engineering community.
Martin Fowler's team at Thoughtworks followed with a detailed analysis. Within weeks, harness engineering had become part of the core AI engineering vocabulary.
The Analogy
Think of the model as an engine. The harness is the car.
The best engine in the world without steering, brakes, and a chassis goes nowhere useful.
What a Harness Contains
There is no single canonical list. Different practitioners frame the harness differently, and the taxonomies are still converging.
Martin Fowler and Thoughtworks propose a two-class split: guides (feedforward controls that steer the agent before it acts, like system prompts and constraint documents) and sensors (feedback controls that observe after it acts, like evals and validation loops).
OpenAI's Codex team organizes their harness around three themes: context engineering, architectural constraints, and "garbage collection" to fight entropy.
LangChain identifies seven primitives: filesystems, code execution, sandboxes, memory and search, context management, planning and self-verification, and long-horizon execution patterns.
Anthropic emphasizes tools, permission models, hooks, and multi-session memory through initializer agents and progress files.
Synthesizing across these frameworks, a production harness typically contains the following components.
1. Tools and capabilities. What the agent can access. File I/O, shell execution, code interpreters, web fetching, database queries, MCP integrations. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, and result formatting.
2. Context management. Instruction files like AGENTS.md (a cross-tool open standard) and CLAUDE.md (Claude Code) that get loaded into context on agent start. RAG retrieval, context compaction, tool call offloading, and progressive disclosure via skills.
3. Memory. Short-term (the active conversation window), long-term (filesystem persistence, progress files, git history), and cross-session (auto-memory systems like Claude Code's MEMORY.md index that accumulate knowledge across runs).
4. Verification loops. Self-evaluation, pre-defined test suites, forced verification passes before exit. LangChain calls this the highest-impact pattern in harness engineering.
5. Architectural constraints. Deterministic linters, structural tests, and dependency rules. Instead of telling the agent "write good code," you mechanically enforce what good code looks like. Paradoxically, constraining the solution space makes agents more productive. When an agent can generate anything, it wastes tokens exploring dead ends. When the harness defines clear boundaries, the agent converges faster on correct solutions.
6. Permission and sandbox models. What requires human approval, isolated execution environments, reversible file edits. Claude Code's default stance is read-only until the user grants explicit approval.
7. Hooks and middleware. Inject custom logic at agent lifecycle events: before model call, after tool call, before exit. This is where loop detection, PII filtering, and pre-completion checks live.
8. Entropy management. Agents that run periodically to find inconsistencies and fight drift. In a codebase that grows fast (as it does when agents write most of the code), contradictions accumulate. Automated cleanup agents keep the system coherent.
The common thread across every framework: a harness is everything except the model. It is the infrastructure that turns raw model intelligence into a system that works reliably in production.
The Proof Point
LangChain provided the clearest empirical evidence.
Their coding agent improved from 52.8% to 66.5% on Terminal Bench 2.0, jumping from Top 30 to Top 5. They changed nothing about the model. Same model (GPT 5.2 Codex). Different harness. Dramatically better results.
The key techniques:
- Self verification loops. Agents would write code, re-read it, decide it looked fine, and stop. LangChain added a forced verification pass before exit. This single hook was a major factor in the 13.7 point improvement.
- Local context middleware. Agents wasted significant effort figuring out their working environment (directory structures, available tools, Python installations). LangChain now maps all of this upfront and injects it directly.
- The "reasoning sandwich." High reasoning for planning, standard reasoning for implementation, high reasoning again for verification. This strategic allocation focuses expensive compute on stages where it provides maximum value.
- Loop detection. Middleware to catch agents stuck in repetitive cycles and break them out.
Same model. Different harness. The harness was the variable that mattered.
The Mindset Shift
Hashimoto captured the core reframe:
"You stop saying 'this model is dumb.' You start saying 'my system allowed this failure mode.'"
This moves the burden away from waiting for the next model release and back to the builder. Every time the agent makes a mistake, the response is not to blame the model. It is to engineer a solution so the agent never makes that mistake again.
In practice, this produces a harness that grows organically. A missing constraint becomes a new linter rule. A forgotten step becomes a pre-completion hook. A recurring hallucination becomes a line in AGENTS.md or CLAUDE.md. A flaky action becomes a programmed tool with validation built in. Each addition is a lesson learned, encoded so the agent never has to learn it again.
The harness is not designed up front. It is accumulated, one failure at a time.
The Pattern
Step back and the three eras form a clear arc.
| Era | Focus | Core question | Failure mode |
|---|---|---|---|
| Prompt engineering | The words you type | "How do I phrase this?" | Model misunderstands intent |
| Context engineering | The information you provide | "What does the model need to know?" | Model lacks relevant data |
| Harness engineering | The system you build | "How do I make the agent reliable?" | System allows preventable errors |
Each era moved the bottleneck further from the model and into the surrounding system. The model becomes more commodity like with each phase. The differentiation lives in the harness.
What This Means for Builders
Prompting still matters. But it is now a subset of context engineering, which is itself a subset of harness engineering. The hierarchy is nested, not sequential. You still need to write good prompts. You also need to design good context. And you need to build good systems. Each layer contains the ones below it.
The skill set shifted toward systems engineering. The new requirement is not creative writing. It is thinking about feedback loops, verification, error recovery, and architectural constraints. It is closer to building reliable distributed systems than crafting clever sentences.
Harness engineering is a moat. Manus spent six months on five complete rewrites. LangChain spent a year on four architectures. You cannot download a harness from Hugging Face. You have to build, test, fail, learn, and rebuild. That investment creates an advantage that model improvements alone cannot overcome.
Models and harnesses are co-evolving. Frontier coding models are now post trained on their own harnesses. Claude is trained on Claude Code. GPT 5 is trained on Codex. The harness shapes model behavior, and the model's capabilities inform harness design. This feedback loop means the gap between well-harnessed and poorly-harnessed agents will keep widening.
The mindset shift is the most important takeaway. Stop blaming the model. Start improving the system. As Hashimoto put it: engineer the harness so the agent never makes the same mistake twice.
Where This Is Heading
The value keeps moving outward. From the words in a prompt, to the information in the context, to the system around the model. What you build around the LLM now matters more than which LLM you pick.
The people who thrive in this era will not be the best prompt writers. They will be the best system builders.
Enjoyed this post?
If this brought you value, consider buying me a coffee. It helps me keep writing.