Agentic Engineering Is Pattern Engineering
Golden values, reference calculators, and the three-way check that catches what green tests miss
Most conversations about AI agents and code focus on the wrong layer. How fast can it ship a feature. How many lines per hour. Whether it can replace a junior dev or a senior one.
That framing treats the agent like a faster human. It isn't. A human developer builds context over months, catches their own mistakes through experience, and knows when something feels wrong. An agent writes what you ask for, exactly as you ask for it, with no institutional memory and no gut check.
The real skill in agentic engineering isn't prompting. It's building patterns that make correctness structural — systems the agent operates within, not instructions the agent follows once.
Green Suite, Broken Spec
When an agent finishes a task, you get a green test suite and a summary of what it did. That feels like verification. It isn't.
The tests the agent wrote might be correct. They might also be tautologies — comparing inputs to themselves, checking that code returns what code returns. A green suite doesn't prove the agent built everything in the spec. It proves the agent built something that passes the agent's own tests.
Three specific failures hide in green suites:
Unverified spec fields. Your spec defines estimated_tax_impact. No test compares it to actual output. The field looks specified. Nothing enforces it.
Stale references. The agent renames scenario keys during a refactor. Tests that hardcode the old name load nil, compare nil to nil, and pass.
Silent skips. A verifier has code to check a field, but a conditional upstream short-circuits. The comparison never executes. "Covered" in code, never compared at runtime.
Each passes a traditional test suite. Each is a spec violation hiding in plain sight.
A Pattern, Not a Practice
The fix isn't "review more carefully" or "write better prompts." It's a structural pattern the agent operates within — one that fails mechanically when the spec isn't fully enforced.
The spec lives in YAML. Expected outputs for every scenario the system handles — what I call golden values. The tests enforce the spec, not the other way around. Hardcoded expected values in tests are drift waiting to happen.
The pattern has three mechanical properties:
Every field in the spec must be enforced by code. A field that exists but isn't checked by any verifier is documentation, not verification. If it's worth specifying, it's worth testing.
Every reference must resolve. Every scenario key in the codebase must point to a real entry in the YAML. Rename a scenario without updating the references — the system catches it.
Every comparison must actually execute. Not "code exists to check this field" — the comparison ran at runtime. A VerificationResult object tracks every field comparison during the test run. After the run, a meta-test asserts the tracked fields cover every registered field. Silent skip — fail.
The Three-Way Check
One more constraint surfaced from building this pattern into a second app: the scenario runner was reading YAML inputs back as outputs. The spec said tax_category: revenue. The runner read that same value from the YAML and compared it to itself. Green test, zero verification.
The fix is three independent sources for every comparison:
Golden values — what the YAML says the answer is.
Reference calculator — pure Ruby, no ActiveRecord, derives the answer independently from business rules.
Production output — the actual system computing the answer for real.
All three must agree. If any two diverge, the test tells you which leg is wrong. Without the third leg, you're testing your spec against your spec.
The Convention
The whole pattern is four files in a Rails app:
lib/accounting_scorecard/
golden_values/accounting.yml # scenarios with inputs and expected outputs
reference_calculator.rb # pure Ruby — derives answers from rules
scenario_runner.rb # creates real records, runs production code
verification_result.rb # tracks every field comparison
test/accounting_scorecard/
completeness_test.rb # Levels 1-3, enforced as Minitest assertions
About 2,000 lines. No gem, no framework. A convention.
The agent reads the golden values, writes failing tests, implements until they pass. The completeness check proves it covered everything. The human reviews the YAML — not the diff.
That's what agentic engineering actually looks like. Not faster typing. Better patterns. The spec is structured as dimensions, the scenarios are intersections, bugs map to gaps in the matrix, and the pattern applies to about 30% of your app — the 30% where getting it wrong costs the most.
This is part of a series on spec-driven development with AI agents. Next: Humans Think in Dimensions, Not Test Cases.

Book & App — Launching September 2026
Without Expectation
Debugging Life's Complex Systems
The same systematic approach engineers use to debug complex systems — applied to the complex system of your life. Learn to observe without judgment, distinguish symptoms from root causes, and run small experiments that compound into massive change.
- 23 chapters
- AI prompt templates
- iOS companion app
- Print, digital & audio
If you liked this, you might also like...
Where Specs Work and Where They Don't
Spec-driven development works for about 30% of a typical application. That happens to be the 30% where bugs cost the most.
Bugs Are Missing Scenarios
In spec-driven development, the code isn't wrong — it was never told what to do for that combination of inputs. The fix is always the same: add the scenario.
