The Two Techniques That Make Agentic Engineering Reliable

Everyone agrees agentic coding is getting better fast. Not everyone agrees it is getting safer. Both observations are correct.

The capability curve is steep. The risks are real. But there are two techniques that close the gap between what AI agents can do and what you can trust them to do: test-driven development and spec-driven development.

Neither idea is new. What is new is how well they work when applied to coding agents.

The Capability Curve Is Steep

The numbers tell the story.

On SWE-bench Verified, a benchmark where AI agents solve real GitHub issues, top scores climbed from roughly 20% in mid-2024 to over 80% by early 2026. That is a fourfold improvement in under two years.

According to SemiAnalysis, Claude Code now accounts for roughly 4% of all public GitHub commits, doubling in a single month. Their projection: 20% by the end of 2026. And that only counts identifiable agent commits. The real number is higher.

Inside Anthropic, CEO Dario Amodei said in October 2025 that "90% of the code written at Anthropic is written by Claude." The JetBrains 2025 developer survey found 85% of developers now regularly use AI tools for coding, up from 49% the prior year.

Andrej Karpathy coined "vibe coding" in February 2025 to describe the joy of letting AI write code while you sit back. Exactly one year later, in February 2026, he retired the term. LLMs had gotten good enough that casual prompting was no longer sufficient. He introduced a replacement: "agentic engineering," where agents write 99% of the code and the human's job is to supervise. The framing shifted from fun experiment to professional discipline.

The Concerns Are Real

The worry about agentic coding going wrong is not paranoia. It is pattern recognition.

In December 2025, researchers identified over 30 security vulnerabilities across 10+ AI coding tools, including GitHub Copilot, Cursor, Windsurf, and Gemini CLI. The result: 24 CVEs assigned and security advisories from major vendors. Every single AI coding tool tested was vulnerable to a novel attack chain combining prompt injection, tool exploitation, and IDE features.

The OWASP Top 10 for Agentic Applications, released in December 2025 and peer-reviewed by over 100 security researchers, formalized the risk categories. Agent goal hijacking, tool misuse, rogue agents, cascading failures. These are not theoretical threats. They are documented attack surfaces.

A Dark Reading poll found 48% of cybersecurity professionals identify agentic AI as the number-one attack vector for 2026. Yet only 37% of organizations have a formal policy for securely deploying AI, according to Darktrace's State of AI Cybersecurity 2026 report.

Meanwhile, the JetBrains survey revealed a trust paradox: adoption jumped from 49% to 85%, but trust in AI tools actually dropped from 40% to 29%. Developers are using tools they do not fully trust, because the productivity gains are too large to ignore.

Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

The concerns are valid. The question is what to do about them.

TDD: The Old Technique That Becomes Essential

Test-driven development is not a new idea. Write the test first. Watch it fail. Write the code to make it pass. Refactor. Repeat.

What is new is how perfectly TDD maps onto the agentic coding workflow.

Simon Willison's Agentic Engineering Patterns guide calls red/green TDD "a fantastic fit for coding agents." The reasoning is straightforward:

Tests are the exit criteria. Without tests, the agent writes code, declares "done," and you hope for the best. With tests, the agent has an objective, verifiable finish line.
Tests constrain the output. They prevent hallucinated code, unnecessary abstractions, and silent regressions. The agent cannot wander if the tests define exactly what success looks like.
Tests make iteration safe. The agent can try, fail, and try again without breaking things. Each attempt runs against the same test suite, which acts as a guardrail.

A tool called Superpowers (by Jesse Vincent) enforces this discipline strictly. It requires coding agents to follow the red-green-refactor cycle. Write a failing test first. Watch it fail. Then write the minimum code to make it pass. If the agent writes code before tests, Superpowers deletes it and forces a restart.

The most striking example of what this produces: Dan Blanchard, the longtime solo maintainer of chardet (Python's widely used character encoding library), used Superpowers to orchestrate a complete rewrite of the project. Blanchard had maintained chardet unpaid for over twelve years. He wanted to modernize it but the scope of work was too large for one person in spare time.

With Superpowers guiding Claude Code through strict TDD, the agent built a new 12-stage detection pipeline, trained bigram frequency models on multilingual corpus data, and produced a test suite covering 2,161 test files across 99 encodings and 48 languages.

The result: chardet 7.0.0, shipped March 4, 2026. It was 41x faster with mypyc compilation. Accuracy rose to 96.8%, up 2.3 percentage points. It closed more open issues in a single release than the project had resolved in the prior decade.

One person. One agent. Strict TDD. A decade of backlog cleared.

Spec-Driven Development: A New Discipline

TDD answers the question "did it work?" Spec-driven development answers the question "did it build the right thing?"

Spec-driven development (SDD) is an emerging methodology where you write structured specifications before any code is written, then use those specs to guide AI agents. Requirements, acceptance criteria, architecture decisions, and constraints go into a document that the agent reads before it touches a single line of code.

Birgitta Böckeler at ThoughtWorks, writing on martinfowler.com, published the definitive analysis of SDD. She examined three tools that call themselves spec-driven and mapped out the landscape:

Kiro (AWS): Released mid-2025 and now generally available, Kiro enforces a structured workflow: spec, design, tasks, implementation. Before writing code, it generates a requirements.md (what you are building and why) and a design.md (how it will be built). Only then does it produce implementation tasks.
Spec-kit (GitHub): An open-source toolkit with three core commands: /specify, /plan, and /tasks. It turns specifications into executable artifacts.
Tessl: The most ambitious of the three. Tessl aims for "spec-as-source," where code is fully generated from annotated spec files and the spec lives longer than any individual implementation.

Addy Osmani's guide "How to Write a Good Spec for AI Agents," also published on O'Reilly Radar, lays out practical advice:

Focus on what and why, not how. A good spec reads like acceptance criteria, not pseudocode.
Break large tasks into smaller ones. Research shows models degrade when asked to satisfy too many requirements at once. Osmani calls this the "curse of instructions." A smarter spec beats a longer spec.
Set clear boundaries. Osmani recommends a three-tier system: actions the agent should always take without asking, actions that require human approval first, and actions the agent must never take (like committing secrets or modifying database schemas).

The key insight from Böckeler's analysis: in SDD, specs should live longer than the code. Code becomes a by-product of well-written specifications. This inverts the traditional relationship where code is the source of truth and documentation is an afterthought.

The Combination: Specs Define Intent. Tests Verify Output.

Each technique is powerful alone. Together, they close the loop.

Specs tell the agent what to build and why. Tests prove it built the right thing. The workflow looks like this:

Write the spec. Define requirements, acceptance criteria, architecture constraints.
Write the tests from the spec. Each acceptance criterion becomes one or more test cases.
Let the agent implement until the tests pass. The agent iterates freely within the bounds you have set.
Review. The human checks the result against both the spec and the tests.

This is not a radical workflow. It is how disciplined software teams have always worked: requirements, then tests, then code. What has changed is that the "coder" is now an AI agent that can iterate at machine speed.

The combination solves the core tension of agentic coding. You want the agent to move fast, but you need it to stay on track. Specs narrow the scope. Tests verify the output. The agent operates freely within those constraints, and you review the result with confidence.

Addy Osmani's LLM coding workflow captures this principle: "The best results come when you apply classic software engineering discipline: design before coding, write tests, use version control, maintain standards." The tools are new. The discipline is not.

The Principles Are Old. The Application Is New.

The capability curve is not slowing down. The concerns about safety and quality will remain. More vulnerabilities will be found. More projects will fail due to inadequate controls.

The answer is not to avoid the tools. The productivity gains are too large. The answer is to use them with discipline.

Specs and tests are not overhead in the agentic era. They are the control plane. Specs give the agent direction. Tests give you verification. Together, they turn an eager but unreliable coding partner into one you can trust.

The developers who thrive will not be the best prompt engineers. They will be the ones who write the clearest specs and the most rigorous tests. These are old disciplines applied to a new paradigm.

The tooling is new. The principles are not.