Benchmarking AIRUP — How Do You Measure a Process Designed for AI Agents?

Over the past four posts, I've built the conceptual case for AIRUP: why RUP deserves a second chance, what SDD brings to the table, how the AI Governor controls cost, and why the SDD-RUP mapping isn't accidental. But concepts don't graduate into theses. Evidence does. This post is about how I intend to produce that evidence — and why the obvious metrics are the wrong ones.

The Benchmarking Problem

Ask anyone building multi-agent systems whether their approach works, and you'll get enthusiastic demos. An agent writes code! Another agent reviews it! They iterate! Tests pass! Ship it!

But "it works" is not a benchmark. "It works better than the alternative" is. And that requires answering uncomfortable questions:

Better than what? A single agent? An ungoverned multi-agent pipeline? A human team?
Better how? Faster? Cheaper? Higher quality? More consistent?
Better at what scale? A TODO app? A 50-requirement enterprise feature? A regulated system?

Without answering these questions rigorously, any claim about AIRUP — or any other AI-driven development approach — is anecdote dressed as evidence. My thesis needs more than anecdotes.

"The most dangerous sentence in AI engineering is 'it worked in the demo.' Demos are curated experiences. Benchmarks are adversarial ones."

Why Token Count Is the Wrong Metric

Let me address the obvious metric first, because it's the one people reach for instinctively: total tokens consumed.

At first glance, it makes sense. Tokens cost money. Fewer tokens = cheaper = better. Simple. And wrong.

Here's why. Consider two pipelines implementing the same feature:

Pipeline A uses 200,000 tokens. Produces code that passes 90% of tests. Has 3 orphan requirements that nobody noticed. No traceability. Architectural decisions undocumented.
Pipeline B uses 800,000 tokens. Produces code that passes 100% of tests. Full traceability chain. All architectural decisions recorded. Zero orphan requirements.

Pipeline A is "4x more efficient" by token count. Pipeline B is objectively better by every quality dimension. If Pipeline A's missing requirements surface in production, the cost of fixing them will dwarf the token savings.

Token count is an input metric. It tells you how much fuel you burned, not how far you traveled. The right metric is tokens per unit of verified outcome — how many tokens did it take to produce one requirement that's implemented, tested, and traceable? That's an efficiency metric with a denominator that actually means something.

As I discussed in the Governor post, token costs are deflating at 99%+ over two years. Optimizing for raw token count is like optimizing for disk space in 2026 — solving a constraint that's rapidly becoming irrelevant. What matters is what the tokens produce, not how many you spent.

The Three Dimensions That Actually Matter

My experimental framework evaluates AIRUP across three orthogonal dimensions. Each captures something the others miss:

Dimension 1: Coordination

Do agents work together coherently, or do they produce contradictory artifacts? In a multi-agent pipeline, coordination failures manifest as:

Artifact conflicts — the architect specifies PostgreSQL, the implementer uses MongoDB
Rework cycles — agent B rejects agent A's output, A revises, B rejects again
Orphan artifacts — requirements that no design decision addresses, or tasks that don't trace to any requirement
Redundant work — two agents independently solving the same sub-problem

Coordination is what RUP's role-based structure is supposed to improve. If AIRUP doesn't measurably reduce coordination failures compared to ungoverned pipelines, the whole thesis falls apart.

Dimension 2: Cost Efficiency

Not raw token count, but normalized cost: resources consumed per unit of value delivered. The specific metric I'm using is:

Cost Efficiency Ratio

CER = Total Tokens / Verified Outcomes

Where Verified Outcomes = number of requirements that are (a) implemented in code, (b) covered by at least one passing test, and (c) traceable through the full chain (REQ→DES→TASK→TEST→CODE).

A pipeline that uses 1M tokens to fully implement 20 requirements (CER = 50K) is more efficient than one that uses 300K tokens to implement 4 requirements (CER = 75K) — even though the latter used 70% fewer tokens.

I also track waste ratio: the percentage of tokens spent on iterations that didn't advance the pipeline (circular debates, re-derivations of settled decisions, validations that could have been deterministic). The Governor's strategies should drive this number down significantly.

Dimension 3: Output Quality

The hardest dimension to measure, and the most important. Quality has multiple sub-dimensions:

Quality Metric	How Measured	What It Indicates
Test pass rate	% of generated tests that pass	Code correctness against spec
Traceability coverage	% of requirements with full chain (REQ→DES→TASK→TEST)	Specification completeness
Orphan rate	# of artifacts not referenced by any downstream artifact	Scope leakage or incomplete design
Defect escape rate	# of defects found in post-pipeline review / total requirements	Pipeline's ability to catch its own mistakes
Architecture fitness	Expert review score (1-5) on design decision quality	Structural soundness (qualitative)
Spec-code alignment	% of spec statements that accurately describe the implemented code	Whether the spec stayed honest

The first four metrics are quantitative and automatable. Architecture fitness and spec-code alignment require human judgment — which introduces subjectivity, but is unavoidable for quality assessment. I mitigate this by using blinded review: the reviewer sees the artifacts but doesn't know which pipeline produced them.

The Experiment

Here's the experimental design I'm using. It's not perfect — I'll address the imperfections in the threats section — but it's rigorous enough to produce meaningful results.

Three Pipelines, Same Feature

Each feature is implemented three times, once by each pipeline configuration:

Pipeline	Configuration	What It Tests
Pipeline U (Unstructured)	Single prompt → agent generates code directly. No specs, no phases, no review.	Baseline: what happens with zero process
Pipeline S (SDD-only)	SDD pipeline: requirements.md → design.md → tasks.md → code. Multi-agent, but no phases, no Governor, no quality gates between artifacts.	Isolates SDD's contribution without RUP
Pipeline A (AIRUP)	Full AIRUP: SDD specs + RUP phases + AI Governor + deterministic quality gates + traceability validation.	The full thesis: SDD + RUP + Governor

Pipeline U is the "vibe coding" baseline — the approach most developers use today when prompting AI agents. Pipeline S isolates SDD's effect. Pipeline A is the full AIRUP treatment. By comparing U→S, I can measure what structured specs add. By comparing S→A, I can measure what RUP-style governance adds on top of specs.

Why Three Pipelines, Not Two?

A two-way comparison (U vs. A) would tell me whether AIRUP is better than no process — but it wouldn't tell me which parts of AIRUP matter. Maybe the specs alone explain all the improvement, and the Governor adds nothing. Maybe the Governor is the differentiator, and the specs are just noise. The three-pipeline design lets me decompose AIRUP's effect into its constituent contributions.

Feature Selection Criteria

Not every feature is suitable for benchmarking. A TODO app doesn't need a process — any pipeline will produce a reasonable result. A distributed banking system is too complex for controlled experimentation. I need features that sit in the sweet spot: complex enough to differentiate the pipelines, simple enough to implement multiple times.

My selection criteria:

15-30 requirements — enough that coordination and traceability matter, not so many that a single run takes days
2-3 components — at least one component boundary, because cross-component coordination is where processes prove their value
Non-trivial business rules — validation logic, state machines, or conditional behavior that creates edge cases
Domain-neutral — no specialized domain knowledge that might advantage one pipeline configuration
Testable — every requirement should be verifiable with an automated test

Example features I'm considering: a library reservation system (state machine: available → reserved → borrowed → returned, with overdue handling), an event ticketing system (capacity limits, waitlists, refund rules), a recipe management API (CRUD + search + ingredient scaling + dietary filtering). None of these are trivial, but all are well-understood enough that an expert can evaluate the output quality.

Execution Protocol

For each feature × pipeline combination, the execution follows a strict protocol:

Input standardization — Each pipeline receives the same initial prompt describing the feature. For Pipeline U, this is the only input. For S and A, the prompt seeds the requirements phase.
No human intervention — Once started, the pipeline runs autonomously. No human fixes, no mid-run corrections. If the pipeline produces broken output, that's a data point.
Fixed model versions — All runs use the same LLM versions, pinned to specific model IDs. This eliminates model drift as a confounding variable.
Multiple runs — Each feature × pipeline combination runs 3 times (minimum) to account for LLM non-determinism. Results are reported as mean ± standard deviation.
Full instrumentation — Every LLM call is logged: prompt, response, token count, model used, latency. This enables post-hoc analysis of where tokens were spent and where waste occurred.

The Metrics Framework

Here's the concrete measurement plan. For each run, I collect:

Metric	Type	Collection Method	Expected AIRUP Advantage
Total tokens (input + output)	Quantitative	LLM call logs	Higher than U, similar or lower than S
Cost Efficiency Ratio (CER)	Quantitative	Tokens / verified outcomes	Lowest (best) of all three
Waste ratio	Quantitative	Circular iterations / total iterations	Lowest (Governor prevents loops)
Test pass rate	Quantitative	Test runner output	Highest
Traceability coverage	Quantitative	Chain validation script	Near 100% (deterministic gates enforce it)
Orphan rate	Quantitative	Graph analysis on artifact IDs	Near 0% (gates catch orphans)
Defect escape rate	Quantitative	Post-pipeline expert review	Lowest
Architecture fitness	Qualitative	Blinded expert review (1-5)	Highest (architecture phase exists)
Spec-code alignment	Qualitative	Blinded expert review (%)	Highest (specs are source of truth)
Wall-clock time	Quantitative	Start-to-finish timer	Longer than U, but acceptable trade

Notice the "Expected AIRUP Advantage" column. I'm pre-registering my hypotheses — stating what I expect to find before running the experiments. This is important because post-hoc explanations are easy to fabricate. Pre-registration forces intellectual honesty: if the results don't match my predictions, I have to explain why, not just rationalize them.

The most interesting prediction: AIRUP will use more total tokens than Pipeline U, but have a better Cost Efficiency Ratio. In other words, AIRUP is more expensive in absolute terms but cheaper per unit of quality. If this holds, it validates the core thesis: the overhead of structured specs and governance pays for itself in output quality.

Threats to Validity (The Honest Part)

No experimental design is perfect. Here are the threats I've identified and how I'm mitigating them:

1. LLM Non-Determinism

Even with temperature set to 0, LLMs produce slightly different outputs across runs. This means two runs of the same pipeline on the same feature can yield different results.

Mitigation: Multiple runs per configuration (minimum 3, ideally 5). Report results as distributions, not single numbers. Use statistical tests (Mann-Whitney U, since distributions may not be normal) to assess whether differences between pipelines are significant or just noise.

2. Feature Selection Bias

If I pick features where structured processes obviously help (e.g., state machines with complex transitions), I'm stacking the deck in AIRUP's favor. If I pick features where they don't (CRUD endpoints), I'm undermining it.

Mitigation: Use a range of complexity levels. Include at least one "simple" feature where Pipeline U should perform well, one "medium" feature, and one "complex" feature. Report results per complexity tier, not just in aggregate. If AIRUP only helps at higher complexity, that's a legitimate finding — not a failure.

3. Experimenter Effect

I designed AIRUP. I wrote the prompts. I built the Governor. I'm deeply biased toward wanting it to succeed. This is the most serious threat.

Mitigation: Three strategies. First, the initial prompts for all three pipelines are derived from the same feature description — I don't give AIRUP better instructions. Second, the blinded expert review ensures that quality assessment is not influenced by knowledge of which pipeline produced the output. Third, I'm publishing the prompts, the code, and the raw data — anyone can reproduce the experiment and verify the results.

4. Model Dependency

Results obtained with GPT-4o might not hold for Claude, Gemini, or DeepSeek. AIRUP might work well with one model family and fail with another.

Mitigation: The primary experiments use a single model family (pinned version) for consistency. A secondary experiment reruns a subset of features with a different model family to test generalizability. Full cross-model testing is out of scope for the thesis but noted as future work.

5. Scale Limitation

Features with 15-30 requirements are not enterprise-scale. AIRUP's value proposition grows with complexity, but I can't test at enterprise scale within a thesis timeline.

Mitigation: Acknowledge the limitation explicitly. Frame results as "at this complexity level" and extrapolate cautiously. Use the complexity-tier analysis to show the trend — if AIRUP's advantage grows from simple to medium to complex features, it's reasonable to hypothesize that the trend continues at larger scale, even though I can't prove it directly.

"A thesis that hides its weaknesses isn't rigorous — it's fragile. A thesis that names them, bounds them, and works within their constraints is one that reviewers trust."

Early Signals

I haven't completed the full experimental run yet, but I've done enough preliminary work to share some early signals. These are observations, not results — the sample size is too small for statistical claims. But they're interesting enough to report.

Signal 1: Unstructured pipelines fail silently

Pipeline U (single prompt → code) consistently produces code that looks right but misses 20-40% of the requirements. Not dramatically — not a crash or an error. Just quiet omissions. A validation rule that wasn't implemented. An edge case that wasn't handled. A state transition that was forgotten.

This is the most dangerous failure mode: the output looks complete, passes a casual review, but silently drops scope. Without a requirements checklist to verify against, you wouldn't know anything was missing until a user hits the gap in production.

Signal 2: SDD catches the omissions, but creates new problems

Pipeline S (SDD without governance) catches most of the scope gaps — because the requirements are written down and the implementation agent can check them off. But it introduces a new problem: spec-code drift. Without quality gates between phases, the design.md sometimes contradicts the requirements.md, and the implementation follows the design (which is closer in the pipeline) rather than the requirements (which are authoritative).

In one run, the requirements specified a maximum of 5 items per user. The design document mentioned this constraint but described it as "configurable." The implementation made it configurable with a default of 10. The original requirement was effectively overridden by the design document's interpretation — a subtle but consequential drift.

Signal 3: The Governor's biggest win is not cost

I expected the AI Governor's main contribution to be cost reduction through tiered model routing and contextual pruning. In practice, its biggest impact has been catching circular reasoning.

Without the Governor, agents in Pipeline S occasionally enter a refinement loop where Agent A proposes something, Agent B requests a change, Agent A revises, Agent B finds a new issue in the revision, and so on. These loops converge eventually, but they converge on marginal improvements — spending thousands of tokens to debate whether an error message should say "invalid input" or "input validation failed."

The Governor's circuit breaker cuts these loops after 3 iterations and either accepts the current version or escalates. In preliminary runs, this alone reduced Pipeline A's token consumption by roughly 15-25% compared to Pipeline S — not by being smarter, but by being impatient with diminishing returns.

Signal 4: Deterministic validation is underrated

The traceability chain validation — a purely deterministic check that every REQ has a DES, every DES has a TASK, every TASK has a TEST — catches 3-5 structural issues per feature run that would otherwise propagate into code. These are the kinds of issues that an LLM reviewer might or might not notice, depending on context window and attention.

The deterministic check notices every single time, costs zero tokens, and takes milliseconds. It's the highest-ROI component in the entire pipeline, and it only works because SDD's structured format makes it possible. You can't run a traceability check on prose requirements.

The Emerging Picture

From early signals, AIRUP's advantage isn't dramatic in any single dimension — it's cumulative across multiple small improvements. Better scope coverage (vs. U). Fewer spec-code drifts (vs. S). Less circular waste. Deterministic safety nets. None of these alone is revolutionary. Together, they compound into a meaningfully better outcome.

Whether this picture holds at full experimental scale is what the next few months will tell.

What This Experiment Doesn't Test

Intellectual honesty requires naming the questions I'm not answering:

AIRUP vs. human teams. Comparing an AI pipeline to a human team is a different experiment with different controls, different timescales, and different ethical considerations. I'm comparing AI pipeline configurations against each other — not against humans. The question isn't "can AI replace developers?" but "given that you're using AI agents, does a structured process help?"

Long-term maintainability. I can measure whether the code is correct and well-structured at time of creation. I can't measure whether it's maintainable six months later, because the experiment doesn't run for six months. Maintainability is probably AIRUP's strongest argument (documented architecture, traceable requirements, living specs), but I can't prove it within this design.

Team dynamics. AIRUP involves a human in the loop for key decisions. How well does this work with different types of humans? Does AIRUP frustrate experienced architects who "just want to code"? Does it help junior developers who need structure? These are interaction design questions, not process evaluation questions.

Regulated environments. AIRUP's traceability chain is designed for contexts where auditing matters — healthcare, finance, defense. But testing in these environments requires domain expertise and regulatory knowledge that I don't have. I can argue that the traceability chain satisfies audit requirements; I can't prove it.

From Concept to Evidence

This post marks the inflection point of the AIRUP series. Posts 1-4 built the theoretical case: what AIRUP is, why it should work, how its components fit together. Post 5 describes the empirical apparatus: how I intend to test whether it actually does.

The experiment isn't designed to prove that AIRUP is the best possible approach to AI-driven development. That would be an absurd claim. It's designed to test a more modest hypothesis: that adding structured specifications and process governance to a multi-agent pipeline produces measurably better outcomes than not adding them, at a cost that's proportionate to the benefit.

If the data supports this hypothesis, AIRUP has a case. If it doesn't — if the overhead of specs and governance outweighs the quality gains, or if the gains are statistically insignificant — then I have a different kind of thesis: one that explains why the intuitive appeal of structured processes doesn't survive contact with empirical testing. That would also be a valuable contribution.

Either way, the data wins. That's how science works.

"A good thesis isn't one that confirms the researcher's beliefs. It's one where the researcher designed an experiment that could have falsified them — and then ran it anyway."

Next up: Building an AIRUP Prototype — from experimental design to running code. The architecture of the pipeline, the agent prompts, the Governor's decision engine, and the mistakes I made along the way.

AIRUP Benchmarking Experimental Design Multi-Agent Systems Metrics Token Economics SDD Software Engineering

Ricardo Costa

Software engineer exploring the intersection of classical software processes and AI-driven development. Currently pursuing a master's degree researching AIRUP — an AI-first approach to the Rational Unified Process.