On March 30, 2026 I ran a 20-minute QA audit on software I had been building for six weeks. It turned up over 100 critical bugs. The test suite had 4,800 passing tests. None of them had caught any of it.
I had been writing specifications. Good ones, I thought. Acceptance criteria, decision records, machine-readable schemas. The agents read them, wrote code against them, wrote tests against the code, and shipped me a green pipeline on top of a broken product. The specs existed. The code existed. Tests were green. The correspondence between them was imagined.
This is the problem specification-driven development was designed to solve, and it is the one it does not solve. A specification handed to an AI agent is read once at the start of a session, referenced imperfectly in the middle, and effectively absent by the end. Agent compliance with prose decays with context length. The longer the work runs, the more confidently the agent hallucinates consistency between what it wrote and what it was supposed to write. Smarter agents do this more than dumber ones, because they are better at producing convincing output with no ground truth.
The fix is not a better specification. It is a specification in a form that cannot be silently overruled: machine-testable, mechanically enforced, cited by every procedure that depends on it. The specification becomes the truth. The code becomes one attempt at implementing it, and either passes or is wrong.
Four commitments
- Machine-testable rules over prose specifications.
- Mechanical enforcement over agent compliance.
- Source-of-truth authority over reconciliation.
- Monotonic rule sets over revisable specifications.
The items on the right are not wrong. They are insufficient for software produced by agents that are not themselves bound by the specification.
What a rule looks like
Before I argue about rules, here is one. It is canonical in the rebuild I am running, and it governs approximately every entity access in a multi-org system.
Cross-organization access
Origin. Resolution of audit finding HC-1 (2026-04-07 authority-hierarchy lockdown). Supersedes the earlier FORBIDDEN "Access denied" pattern. The prior pattern leaked entity existence to cross-organization probes. An attacker could distinguish “entity exists in another organization” from “entity doesn’t exist” by the error code returned, enabling enumeration attacks across organizational boundaries.
Rule text. Every single-entity fetch MUST include organizationId = ctx.session.activeOrganizationId in the SQL WHERE clause. If the query returns zero rows, throw NOT_FOUND "<Entity> not found". Never use FORBIDDEN "Access denied". That error leaks entity existence.
Testable assertion. expect(error.code).toBe("NOT_FOUND"); expect(error.message).toBe("<Entity> not found") for any request crossing an organization boundary, regardless of whether the entity exists, is soft-deleted, or belongs to another organization.
Enforcement.
- Write-time: hookify rule blocks the string
FORBIDDEN "Access denied"in any procedure file. - Gate-time: static-analysis rule requires
organizationIdin every WHERE clause touching an organization-scoped table. - Runtime: middleware verifies session and active organization before procedure body executes.
Violation closed. Three distinct failure cases (entity does not exist, entity was soft-deleted, entity belongs to another organization) collapse into a single response. A probing client observes identical output for all three. Enumeration across organizational boundaries ceases to be possible. The rule turns a class of information-disclosure vulnerability into a form the attacker cannot distinguish from a correctly-operating system.
Every rule in the system has the same shape: a layer, an origin, a text statement, a testable assertion, the enforcement layers that block violations, and a summary of what failure mode is now closed. A rule missing any of the six components is not yet a rule. It is a wish.
The three layers
Rules decompose into three layers, and a domain’s specification is not complete until all three exist.
Spec defines canonical behavior. Inputs, authentication, error codes, side effects, response shape. A representative spec rule: money columns must be stored as integer cents in PostgreSQL bigint, never as integer. The integer type caps at $21,474,836.47 in cents and silently overflows on luxury properties. This is the layer most methodology writing already addresses. If you have read any API documentation, you have seen spec rules.
Consistency enforces cross-procedure invariants. It is the layer that catches what individual specs cannot. Every SELECT on a soft-deletable table includes notDeleted(). Every mutation writes an audit entry. Every error matches a canonical string. In the rebuild I am running, one consistency rule enumerates 18 child entity types and specifies the exact application-layer behavior when a parent is soft-deleted. I call the pattern orphan-but-filter: child rows are never cascade-updated, they retain their original state, and every read path joins back through the parent with notDeleted(), so orphaned children become invisible without being destroyed. That rule is 70 lines long and enumerates every entity individually, because the invariant has to hold uniformly or it does not hold at all. Consistency rules are the reason most AI-generated domains subtly leak in production.
Adversarial covers what happens when the system is probed. Wildcard injection, cross-scope access, race conditions, external-service failure, prompt injection, timing channels. A domain that accepts external input and has no adversarial layer is not specified; it is an attack surface. The absence of an adversarial rule at an input boundary is the presence of a vulnerability. AUTH-5 above is an adversarial rule.
A rule set missing any of the three layers ships with a known failure class. The three are not a style preference. They map to the three categories of defect that undifferentiated specifications cannot separate.
How rules get enforced
An unenforced rule is a policy statement. Agents comply with policy statements early in a session and drift from them later. Every rule in the system is paired with the strongest mechanical enforcement its shape admits.
Write-time. Hookify rules fire on every Edit or Write tool call. If an agent tries to write FORBIDDEN "Access denied" into a procedure file, the edit is blocked before it lands. The incident that caused the rule is attached to the hook:
id: block-access-denied-forbidden
matches: 'FORBIDDEN "Access denied"'
reason: AUTH-5. Leaks cross-org entity existence.
Gate-time. ast-grep patterns run before every commit. Where a hook catches a single tool call, gate checks catch drift accumulated across a whole session. Example: every SELECT on an organization-scoped table must include organizationId = ctx.session.activeOrganizationId in its WHERE predicate. If it doesn’t, pnpm gate fails the commit.
Runtime. Middleware enforces what the static layers cannot see. Session validation, active-organization resolution, rate limits, idempotency. All enforced before the procedure body runs.
Lifecycle. Shell hooks react to tool events themselves. PreToolUse:Bash blocks destructive git commands and test-runner bypasses. SubagentStart injects relevant contract registries into the subagent’s initial context so the subagent cannot hallucinate specifications. SessionStart reads current state and injects domain-coverage gaps.
Memory. Every incident that produced a new rule is saved as a feedback memory. Future sessions read the memory on startup. The methodology does not reset between conversations.
Five layers, not one. Each layer leaks by itself. Two layers negotiate around each other. Five is the smallest number that has held so far in the rebuild I am running. When a layer catches something, the incident gets filed as the origin of a new rule. The rule set grows monotonically.
Related methodologies
DRDD is not TDD. In TDD the failing test is the specification; the rule is inferred from the test. In DRDD the rule precedes the test, and tests are regression locks written after QA finds a bug worth locking. Tests-first produced a very clean spec apparatus in the rebuild I was running, and nothing shipped for 10 weeks. Source-first from rules shipped.
The difference from spec-driven development is scope. SDD treats specifications as inputs to code generation; the specification is better when longer, more precise, more structured. DRDD treats specifications as the authoritative artifact the code must conform to. Source code has zero authority. If the code and the rule disagree, the code is wrong.
Design by contract attaches pre- and post-conditions to function boundaries. DRDD attaches assertions at every boundary where violation could occur: the tool call, the commit, the runtime path. DbC is a subset of the runtime layer.
Formal methods buy certainty at a cost most projects cannot absorb. DRDD buys mechanical catchability across most classes of failure, at the cost of an upfront investment in the enforcement stack. It is informal in the specification language (Markdown rules, not Coq) and mechanical in the enforcement layer. The tradeoff is intentional.
What this costs
DRDD is overkill on a small project. The enforcement stack is an upfront investment: hookify rules, ast-grep patterns, lifecycle hooks, MCP servers, and a rule corpus. On a weekend prototype, you write one hook and ship. The methodology pays off when the project is large enough that agent drift across many sessions and many domains starts silently compounding. That threshold is roughly: multi-product, multi-domain, a rebuild under acquisition-grade due-diligence pressure, or any codebase where a single silent failure costs a rewrite.
It also costs discipline. The hardest part is not writing rules. It is refusing to delete a rule when a reviewer finds it inconvenient. A rule set is monotonic. New findings add rules, refine rules, strengthen existing rules with additional layers. They do not remove rules or weaken language from MUST to SHOULD. A gap finding says the rule was incomplete. It does not say the rule was wrong and should be deleted. That asymmetry is what prevents the rationalization spiral where reviewers delete requirements to explain source defects.
Provenance
This methodology is extracted from the construction of a working specification system: 42 domain-rules files, approximately 3,500 rules across 24 domains, a five-layer enforcement stack of 48 hookify rules, 41 ast-grep patterns, 26 lifecycle hooks, 20 custom skills, and 5 custom review agents. The rule set governs authentication, persistence, identifiers, errors, audit logging, permissions, and security at the shared tier, plus every product domain that cites the shared tier. It is enforced continuously against a live codebase at 515 commits across 36 days, solo. The rule files are published on the rules explorer as I anonymize them. What’s there now is a sample. The full corpus is the rebuild.