Thursday, April 2, 2026

Agentic Engineering Is Pattern Engineering

Golden values, reference calculators, and the three-way check that catches what green tests miss

Most conversations about AI agents and code focus on the wrong layer. How fast can it ship a feature. How many lines per hour. Whether it can replace a junior dev or a senior one.

That framing treats the agent like a faster human. It isn't. A human developer builds context over months, catches their own mistakes through experience, and knows when something feels wrong. An agent writes what you ask for, exactly as you ask for it, with no institutional memory and no gut check.

The real skill in agentic engineering isn't prompting. It's building patterns that make correctness structural — systems the agent operates within, not instructions the agent follows once.

Green Suite, Broken Spec

When an agent finishes a task, you get a green test suite and a summary of what it did. That feels like verification. It isn't.

The tests the agent wrote might be correct. They might also be tautologies — comparing inputs to themselves, checking that code returns what code returns. A green suite doesn't prove the agent built everything in the spec. It proves the agent built something that passes the agent's own tests.

Three specific failures hide in green suites:

Unverified spec fields. Your spec defines estimated_tax_impact. No test compares it to actual output. The field looks specified. Nothing enforces it.

Stale references. The agent renames scenario keys during a refactor. Tests that hardcode the old name load nil, compare nil to nil, and pass.

Silent skips. A verifier has code to check a field, but a conditional upstream short-circuits. The comparison never executes. "Covered" in code, never compared at runtime.

Each passes a traditional test suite. Each is a spec violation hiding in plain sight.

A Pattern, Not a Practice

The fix isn't "review more carefully" or "write better prompts." It's a structural pattern the agent operates within — one that fails mechanically when the spec isn't fully enforced.

The spec lives in YAML. Expected outputs for every scenario the system handles — what I call golden values. The tests enforce the spec, not the other way around. Hardcoded expected values in tests are drift waiting to happen.

The pattern has three mechanical properties:

Every field in the spec must be enforced by code. A field that exists but isn't checked by any verifier is documentation, not verification. If it's worth specifying, it's worth testing.

Every reference must resolve. Every scenario key in the codebase must point to a real entry in the YAML. Rename a scenario without updating the references — the system catches it.

Every comparison must actually execute. Not "code exists to check this field" — the comparison ran at runtime. A VerificationResult object tracks every field comparison during the test run. After the run, a meta-test asserts the tracked fields cover every registered field. Silent skip — fail.

The Three-Way Check

One more constraint surfaced from building this pattern into a second app: the scenario runner was reading YAML inputs back as outputs. The spec said tax_category: revenue. The runner read that same value from the YAML and compared it to itself. Green test, zero verification.

The fix is three independent sources for every comparison:

Golden values — what the YAML says the answer is.

Reference calculator — pure Ruby, no ActiveRecord, derives the answer independently from business rules.

Production output — the actual system computing the answer for real.

All three must agree. If any two diverge, the test tells you which leg is wrong. Without the third leg, you're testing your spec against your spec.

The Convention

The whole pattern is four files in a Rails app:

lib/accounting_scorecard/
  golden_values/accounting.yml    # scenarios with inputs and expected outputs
  reference_calculator.rb         # pure Ruby — derives answers from rules
  scenario_runner.rb              # creates real records, runs production code
  verification_result.rb          # tracks every field comparison

test/accounting_scorecard/
  completeness_test.rb            # Levels 1-3, enforced as Minitest assertions

About 2,000 lines. No gem, no framework. A convention.

The agent reads the golden values, writes failing tests, implements until they pass. The completeness check proves it covered everything. The human reviews the YAML — not the diff.

That's what agentic engineering actually looks like. Not faster typing. Better patterns. The spec is structured as dimensions, the scenarios are intersections, bugs map to gaps in the matrix, and the pattern applies to about 30% of your app — the 30% where getting it wrong costs the most.


This is part of a series on spec-driven development with AI agents. Next: Humans Think in Dimensions, Not Test Cases.

Without Expectation book

Book & App — Launching September 2026

Without Expectation

Debugging Life's Complex Systems

The same systematic approach engineers use to debug complex systems — applied to the complex system of your life. Learn to observe without judgment, distinguish symptoms from root causes, and run small experiments that compound into massive change.

  • 23 chapters
  • AI prompt templates
  • iOS companion app
  • Print, digital & audio