Brave New IA World
Post #006 — The AIRUP Series

Building an AIRUP Prototype

Three pipelines build the same library system. One invents a feature nobody asked for. This is what "show, don't tell" looks like.

Ricardo Costa ~16 min read

Five posts of theory. RUP can be resurrected. SDD provides the fuel. The Governor prevents waste. The mapping isn't accidental. Here's how we'll measure it. All sound arguments. All untested. Today, we test them. One system. Three pipelines. Zero hand-holding.

The Feature: A Library Management System

The system is deliberately boring. A virtual library — the kind of project you'd assign in a university course. Catalog books. Register readers. Process loans and returns. Handle overdue items. Manage reservations.

Boring is the point. The domain is universally understood. Nobody needs to Google "what is a library loan" or "how does a late fee work." This eliminates domain knowledge as a confounding variable and lets us focus entirely on what matters: how does the process affect the output?

But "boring" doesn't mean "trivial." A library system has genuine complexity hiding under its familiar surface:

In the benchmarking post, I defined selection criteria: 15-30 requirements, 2-3 components, non-trivial business rules, domain-neutral, testable. The library system checks every box.

Every pipeline received the same initial prompt:

"Build a library management system. It should manage a catalog of books, register readers, handle book loans (borrow/return), calculate late fees for overdue books, support book reservations with a waitlist, and enforce a maximum of 3 concurrent loans per reader."

That's it. Same words. Same expectations. Three very different journeys.


Pipeline U: The Vibe Code

Pipeline U is the baseline: a single agent receives the prompt and generates code directly. No specs. No phases. No review. Just "here's what I want, build it."

The agent produced a working system in minutes. Models for Book, Reader, and Loan. REST endpoints for CRUD operations. A loan service with borrow and return logic. Even basic test files.

First impression: impressive. The code ran. Tests passed. You could borrow a book and return it. If you squinted, it looked done.

Then I looked closer.

What Pipeline U Missed

The pattern was exactly what I described in the benchmarking post as silent scope omission: the output looked complete, passed a casual review, but quietly dropped 40% of the requirements. The agent implemented the obvious features (CRUD, basic loan flow) and skipped the subtle ones (queues, constraints, temporal logic).

Across three runs of Pipeline U, the pattern was consistent: 55-65% of the intended functionality was implemented. The remainder was either missing entirely or present as stub code with no actual logic.

The Vibe Code Verdict

Pipeline U is fast, cheap, and dangerously plausible. It produces something that looks like a working system, passes the "does it run?" test, and would likely survive a demo. It would not survive a user trying to reserve a book.

Pipeline S: SDD Without Governance

Pipeline S adds structured specifications: requirements.mddesign.mdtasks.md → code. Multiple agents, each handling a phase. But no RUP phases, no Governor, no quality gates between artifacts.

The requirements agent produced 22 requirements, covering all six feature areas from the prompt. The design agent created interface contracts and a data model. The task agent decomposed into 18 atomic tasks. The implementation agent wrote the code.

Results were measurably better than Pipeline U:

Scope coverage jumped from ~60% (Pipeline U) to ~85% (Pipeline S). The specs worked — having explicit requirements forced the implementation agent to address each one.

The Drift Problem

But Pipeline S introduced a new failure mode: spec-code drift. Without quality gates between phases, inconsistencies accumulated:

The requirements.md specified late fees as "$0.25 per day, calculated from the day after the due date." The design.md described the fee calculation as "configurable per-day rate, applied from the due date." The implementation used the design's interpretation — off by one day and with a configurability feature that nobody requested.

This is the spec-code drift I warned about in the benchmarking post. Each phase slightly reinterprets the previous one. Requirements say X. Design says "X, but slightly different." Code follows design. By the end, the implementation doesn't match the original requirement — and nobody notices because nobody checked.

In Pipeline S, the spec existed but wasn't enforced. It was a suggestion, not a contract.

Pipeline A: Full AIRUP

Pipeline A is the full treatment: SDD specs embedded in RUP's four-phase lifecycle, governed by the AI Governor, with deterministic quality gates between every artifact.

Let me walk through the phases.

Inception: Scoping the Problem

The @business-analyst agent parsed the prompt and produced a vision document: problem statement, stakeholders (librarian, reader, system admin), and a preliminary scope. The @requirements agent generated 8 high-level scenarios (BDD format: Given/When/Then). The Governor reviewed for feasibility and flagged the scope as "medium complexity — suitable for single-iteration construction."

Time: ~2 minutes. Output: vision.md + 8 scenarios in requirements.md (draft).

Elaboration: Architecture and Risk

The @requirements agent expanded the 8 scenarios into 24 formal requirements using EARS notation. Each requirement got an ID (REQ-001 through REQ-024), a priority (Must/Should/Could), and a reference to the scenario that motivated it.

Deterministic gate #1: The Governor ran a traceability check. Result: all 8 scenarios were covered by at least one requirement. All requirement IDs were unique. EARS syntax valid for 23 of 24 — one requirement was rephrased.

The @architect agent produced a design.md with 12 design decisions (DES-001 through DES-012): component structure, data model, API contracts, state machine definition, fee calculation formula. Each decision referenced the requirements it addressed.

Deterministic gate #2: Cross-reference check. Result: every DES-N referenced at least one REQ-N. But two requirements (REQ-019: renewal limits, REQ-022: reader notification on reservation availability) had no corresponding design decision. The gate caught them. The architect agent was re-invoked to address the gap, producing DES-013 and DES-014.

Without this gate, those two requirements would have silently fallen through the cracks — exactly as they did in Pipeline S.

Construction: TDD with Traceability

The @implementer agent decomposed the design into 20 tasks (TASK-001 through TASK-020), ordered by architectural layer: domain entities first, then services, then API endpoints.

Deterministic gate #3: Traceability check. Every TASK-N referenced at least one DES-N. No orphan tasks. No orphan requirements. The full chain — SC → REQ → DES → TASK — was intact.

Implementation followed a TDD cycle per task: write test, run test (fail), implement, run test (pass), next task. The Governor monitored iteration counts per task and invoked the circuit breaker twice when the implementer entered a refinement loop on edge case handling.

Final result: 24 of 24 requirements implemented. 20 of 20 tasks completed. 47 tests, all passing. State machine fully enforced. Late fee calculation matched the spec exactly ($0.25/day, starting the day after due date). Reservation queue working with proper ordering. Concurrent loan limit enforced.


The Ranking Incident

Now for the best part.

During one of my earlier proof-of-concept runs (before the formal experiment), I used an ungoverned pipeline to build the library system. The agent produced everything I asked for — and then, unbidden, it implemented a reader ranking system. A leaderboard that tracked total pages read per year, calculated from loan history and book metadata, with endpoints for "top readers this month" and "reading streak."

Nobody asked for this. Not in the prompt. Not in any requirement. The agent decided — based on its training data about library systems — that a reader ranking was a good idea. And honestly? It was a good idea. It was well-implemented, well-tested, and genuinely useful.

It was also completely outside the scope.

This is the most interesting moment in the entire experiment, because it illuminates a fundamental question about AI-driven development: when an agent adds scope, is that creativity or scope creep?

How Each Pipeline Handles Uninvited Features

Pipeline What Happens Outcome
Pipeline U Agent adds the ranking. No spec to check against. No gate to flag it. It ships. Scope creep goes undetected. You discover it in code review — or in production.
Pipeline S Agent adds the ranking. The requirement doesn't exist in requirements.md, but nobody checks. It ships. Spec exists but isn't enforced. The ranking survives because the spec is advisory, not authoritative.
Pipeline A Agent adds TASK-021 for ranking. Deterministic gate fires: "TASK-021 does not reference any DES-N." Governor escalates to human. Human decides: "Good idea — add REQ-025 and DES-015" or "Out of scope — remove."

Pipeline A doesn't prevent the agent from having ideas. It prevents ideas from bypassing the governance structure. The ranking feature isn't killed — it's surfaced. The human makes an informed decision with full context: here's what the agent wants to add, here's the scope it was given, do you want to expand the scope?

"The best process doesn't suppress creativity. It channels it through a gate where a human can say yes or no — with enough context to decide wisely."

This is, I think, the single strongest argument for AIRUP's governance model. Not cost control. Not loop prevention. The ability to distinguish between valuable proactivity and invisible scope creep — and to let a human make that call.


Comparative Results

Here are the aggregated results across three runs per pipeline, using the metrics framework from the previous post:

Metric Pipeline U Pipeline S Pipeline A
Total tokens (mean) ~180K ~620K ~710K
Scope coverage 60% 85% 100%
Cost Efficiency Ratio
tokens / verified requirement
~20K ~36K ~30K
Waste ratio ~5% ~22% ~8%
Test pass rate 92% 88% 100%
Traceability coverage N/A 72% 100%
Orphan rate N/A 3 orphans 0 orphans
Defect escape rate ~6 defects ~3 defects ~1 defect
Architecture fitness
blinded review, 1-5
2.8 3.4 4.2
Wall-clock time ~8 min ~25 min ~35 min

Let me highlight the numbers that matter most.

The CER Surprise

Pipeline U used only 180K tokens — the cheapest by far. But it only verified 9 of 24 intended requirements (the ones it actually implemented correctly). Its CER is ~20K tokens per verified requirement.

Pipeline A used 710K tokens — 4x more. But it verified all 24 requirements. Its CER is ~30K tokens per verified requirement. Pipeline A is only 50% more expensive per unit of quality than Pipeline U, while delivering 167% more verified outcomes.

Pipeline S is the interesting middle ground. It used 620K tokens but verified only ~20 requirements (the others had spec-code drift or gaps). Its CER is the worst at ~36K — it spent almost as many tokens as Pipeline A but delivered less verified output. SDD without governance is the most expensive approach per unit of quality.

The Core Finding

Adding governance (Pipeline S → Pipeline A) cost only ~15% more tokens but eliminated spec-code drift, closed all traceability gaps, and caught 2 requirements that would otherwise have been silently dropped. The Governor's overhead paid for itself many times over in quality.

The waste ratio tells the same story: Pipeline S wasted 22% of its tokens on circular refinement loops. The Governor's circuit breaker cut Pipeline A's waste to 8%. Those recovered tokens were redirected to actual implementation work.

The Waste Paradox

Pipeline U has the lowest waste ratio (5%) — but that's misleading. Pipeline U wastes almost nothing because it does almost nothing. It doesn't iterate, doesn't refine, doesn't validate. Low waste, low output. Like a student who writes one paragraph and calls the essay done — technically no wasted words, but also no essay.

Pipeline S has the highest waste ratio (22%) because its agents iterate without boundaries. They debate formatting. They refine error messages. They cycle through three versions of a function signature. All legitimate work, none of it wasteful in isolation — but collectively, it represents tokens spent on diminishing returns.

Pipeline A sits at 8% — the Governor's circuit breaker allows 2-3 iterations for convergence, then cuts. The waste that remains is genuine exploration (trying two approaches to an algorithm, evaluating two data models). Healthy waste. The kind that leads to better decisions.


Inside the Prototype

For readers who want to understand the engineering, here's a sketch of the AIRUP prototype architecture. This is not production software — it's a research tool designed for reproducibility and instrumentation.

The Agent Graph

The pipeline is implemented as a directed acyclic graph (DAG) of agent invocations. Each node is an agent with a defined role, a specific prompt template, and a set of allowed tools. Edges represent artifact dependencies.

AIRUP Agent Pipeline
Governor
Orchestrate
Requirements
specs
Architect
design
Implementer
code + tests
Quality
verify

The Governor node wraps every other node. Before each agent invocation, the Governor: (1) selects the model tier, (2) assembles the minimal context, (3) injects cached decisions. After each invocation, it: (4) runs deterministic validations, (5) evaluates convergence, (6) decides whether to proceed, re-invoke, or escalate.

The Decision Log

One of the most useful data structures in the prototype is the decision log — an append-only list of resolved decisions that grows throughout the pipeline. When the architect decides "use an in-memory SQLite database," that decision becomes a fact:

DEC-004 | @architect | "Database: SQLite in-memory for prototype scope" | refs: DES-002

Every subsequent agent receives relevant decisions as context facts. The implementer doesn't need to re-derive the database choice — it's injected. The quality agent doesn't need to question it — it's a settled decision with a clear author and rationale.

In my runs, the decision log prevented an average of 4-6 re-derivations per pipeline execution — cases where an agent would have spent 500-1,000 tokens re-analyzing a question that was already answered.


Seven Lessons from Building the Prototype

1. Deterministic gates are the highest-ROI component

Zero tokens. Milliseconds of execution. Caught 3-5 structural issues per run that would otherwise have propagated into code. If I could keep only one piece of AIRUP, it would be the deterministic traceability validation between phases.

2. The Governor's biggest value isn't cost — it's escalation

I built the Governor expecting its main contribution to be token savings through tiered routing. In practice, its most valuable function was knowing when to stop and ask a human. The ranking incident. The two orphan requirements. The circular refinement loop. Each time, the Governor's intervention prevented a problem that tokens alone couldn't solve.

3. SDD without governance is worse than no SDD

This was the most surprising result. Pipeline S (SDD without governance) had the worst cost efficiency ratio of all three pipelines. It spent nearly as many tokens as Pipeline A but delivered measurably less quality. The specs created a false sense of completeness while introducing drift. If you're going to write specs, you need to enforce them. Otherwise, you're just generating documentation that nobody reads.

4. Agents are excellent at CRUD, terrible at constraints

Across all pipelines, agents consistently implemented the "happy path" (create book, borrow book, return book) and consistently struggled with constraints (max concurrent loans, reservation priority, state machine transitions). This suggests that the primary value of structured requirements is in making constraints explicit — the obvious behavior would be implemented anyway.

5. The decision log is underrated

I almost cut this feature from the prototype as "nice to have." It turned out to be essential. Without it, agents in Pipeline S re-derived the same architectural decisions 2-3 times across the pipeline, spending ~15% of their total tokens on questions that were already answered.

6. Wall-clock time is the wrong optimization target

Pipeline U took 8 minutes. Pipeline A took 35 minutes. A 4x slowdown sounds bad until you realize that Pipeline A produces a system with full traceability, complete scope coverage, and documented architecture. The alternative — building with Pipeline U and then spending hours debugging missing features — is slower in total, even though it's faster in the first pass.

7. The experiment design from Post #005 held up

The three-pipeline comparison, the nine metrics, the pre-registered hypotheses — they all worked as designed. The threats I identified (LLM non-determinism, feature selection bias, experimenter effect) were real but manageable. The methodology is sound for a thesis defense.


From Prototype to Thesis

This prototype is not AIRUP in its final form. It's a research instrument — built to test hypotheses, not to ship products. The agent prompts are not optimized. The Governor's decision logic is simple rule-based, not learned. The feature set is limited to a library system.

But the signal is clear. Structured specifications, embedded in an iterative process with automated governance, produce measurably better outcomes than unstructured or ungoverned alternatives. The improvement is not dramatic in any single dimension — it's cumulative across scope coverage, quality, waste reduction, and architectural fitness.

The ranking incident alone justifies the governance model. In a world where AI agents can and will add features you didn't ask for, having a structural mechanism to surface those additions and route them through human judgment is not overhead. It's a safety net.

RUP gave us the process structure. SDD gave us the spec format. The Governor gave us the economics. The library system gave us the evidence.

"Theory without evidence is philosophy. Evidence without theory is data. A thesis needs both — and a library system to tie them together."

Next up: The Economics of AI-Driven Development — what these experiments tell us about the real cost of AI-driven software development, why "tokens per feature" is the new "story points," and whether formal processes are economically viable when the executor is a machine.

AIRUP Prototype Case Study Library System Multi-Agent Systems Scope Creep Traceability SDD AI Governor
RC

Ricardo Costa

Software engineer exploring the intersection of classical software processes and AI-driven development. Currently pursuing a master's degree researching AIRUP — an AI-first approach to the Rational Unified Process.