Why Evals are so Important for Specification-Driven Development
Specification-Driven Development gives us a powerful discipline: write the what before the how, and use that specification as a living contract that governs everything the AI agent does. But there is a subtle flaw in this picture if left unexamined. A spec is only as good as its quality, and the quality of a specification is not self-evident. You cannot tell whether a spec is well-formed just by reading it. You need a system that can evaluate it: reliably, repeatably, and at scale.
That system is evals. And in a specification-driven workflow, evals are not an optional quality gate bolted on at the end. They are the mechanism that makes the entire workflow trustworthy.
Specifications Are Probabilistic Too
We tend to think of the specification as the stable anchor in an otherwise unpredictable AI pipeline. The LLM is non-deterministic; the spec is not. But this framing is too comfortable. Specifications are written by humans (or increasingly, co-authored with AI) and they carry all the ambiguity, incompleteness, and internal inconsistency that human language invites.
A feature spec that says "users should be notified when a rule triggers" leaves open who is notified, by what mechanism, under what conditions, and with what content. An implementation agent reading that spec will make assumptions, and those assumptions will be invisible to the reviewer. The bug is not in the code. It is in the specification. And it was there long before any code was written.
This is the core problem evals solve. An eval doesn't ask whether the code is correct. It asks whether the specification itself is fit for purpose, before the agent ever touches a keyboard.
What Evals Actually Measure
An eval in this context is a structured, scored assessment of a document against a rubric. Not a linter. Not a schema validator. A rubric-based quality score that surfaces the dimensions of specification quality that actually matter for agentic execution.
Good evals for specification-driven development measure things like:
- Completeness: Are the acceptance criteria present? Are they testable? Does the spec cover the unhappy paths, not just the happy path?
- Boundary clarity: Does the spec stay inside its domain? Are cross-domain references handled as IDs and events, not as embedded assumptions about another team's data model?
- Traceability: Is the spec connected to a parent capability, a data model, and an ADR? Can an agent trace intent all the way from the product vision down to this feature?
- Ambiguity exposure: Does the spec contain terms that are undefined in the domain glossary? Does it use "should" where it means "must"?
- Architectural consistency: Does the spec's implied design contradict existing platform decisions? Would implementing it as written violate a recorded ADR?
These are not stylistic preferences. They are structural properties that determine whether an implementation agent can execute the spec faithfully, or whether it will be forced to hallucinate the gaps.
What Evals Actually Measure
An eval in this context is a structured, scored assessment of a document against a rubric. Not a linter. Not a schema validator. A rubric-based quality score that surfaces the dimensions of specification quality that actually matter for agentic execution.
Good evals for specification-driven development measure things like:
- Completeness: Are the acceptance criteria present? Are they testable? Does the spec cover the unhappy paths, not just the happy path?
- Boundary clarity: Does the spec stay inside its domain? Are cross-domain references handled as IDs and events, not as embedded assumptions about another team's data model?
- Traceability: Is the spec connected to a parent capability, a data model, and an ADR? Can an agent trace intent all the way from the product vision down to this feature?
- Ambiguity exposure: Does the spec contain terms that are undefined in the domain glossary? Does it use "should" where it means "must"?
- Architectural consistency: Does the spec's implied design contradict existing platform decisions? Would implementing it as written violate a recorded ADR?
These are not stylistic preferences. They are structural properties that determine whether an implementation agent can execute the spec faithfully, or whether it will be forced to hallucinate the gaps.
The Rubric as a Shared Contract
The power of a rubric-based eval is that it makes quality criteria explicit and shared. When a team agrees on a rubric, they are agreeing on what "good" means before the first spec is written. This is context engineering applied not to the LLM's runtime environment, but to the organisation's entire documentation practice.
A rubric creates a feedback loop that scales. A human reviewer cannot hold all the dimensions of spec quality in their head simultaneously (boundary hygiene, traceability, ADR consistency, entity model alignment), especially across hundreds of specs spanning a dozen domains. An eval can. And because it produces a numeric score with per-dimension breakdowns, it creates an objective basis for the conversation between product, architecture, and engineering about whether a spec is ready to lock.
This is not about replacing human judgment. It is about giving human judgment a stable surface to push against.
Evals as a Gate, Not a Suggestion
In a mature specification-driven workflow, an eval score is a gate condition. A spec that scores below threshold on intent quality does not proceed to domain model alignment. A domain architecture that scores below threshold on boundary discipline does not proceed to contract definition. A contract that fails its eval does not proceed to implementation.
This is the eval's real job: not to report quality after the fact, but to enforce a standard of readiness at each transition point in the delivery pipeline.
The alternative (allowing a low-quality spec to proceed on the grounds that "we'll fix it later") is precisely the pattern that erodes specification-driven development from within. Every ambiguity that passes through the gate becomes a constraint on the implementation agent, a source of rework, and a gap in the audit trail. The cost of a weak spec compounds at every downstream stage.
The Eval Closes the Feedback Loop
There is one more reason evals matter that is easy to miss. In a specification-driven workflow, the specification is not just an input to the agent. It is the primary artifact of record. It is what the organisation learns from. It is how architectural decisions accumulate into institutional knowledge.
If specifications are never evaluated, they drift. Teams develop local conventions that diverge from the platform standard. Domain models accrue inconsistencies that only become visible when two services try to integrate. The living contract becomes a collection of loosely related documents that no single agent (human or AI) can reason over coherently.
Evals keep the specification corpus honest. They surface drift early, when it is cheap to correct. They create a continuous signal about the health of the platform's knowledge base, not just the health of the code.
The Architect, Now a Rubric Designer
Just as specification-driven development promotes the engineer from line-by-line coder to system architect, it promotes the architect to rubric designer. The question is no longer only "what is the right architecture?" It is also "what criteria would an agent need to assess whether a specification reflects that architecture?"
That is a harder and more interesting question. It requires deep fluency in the domain model, the ADR record, and the intended patterns of cross-domain collaboration. It requires the ability to articulate tacit knowledge as explicit, scorable criteria. And it requires the humility to accept that a rubric is also a specification, and that it, too, needs to be evaluated, iterated, and maintained.
In a specification-driven world, the quality of the system is a function of the quality of the specifications. And the quality of the specifications is a function of the quality of the evals that govern them.
Get the evals right, and everything downstream gets better. Skip them, and you have built a very fast machine for producing very confident mistakes.
Related reading: Specification-Driven Development · Context Engineering · The Agent as Implementer