Brave New IA World
Post #007 — Behind the Scenes

The RUP AI Kit

Six agents. One pipeline. Zero meetings. A tour of the actual machinery that makes AIRUP work — the agents, their artifacts, and the protocol that keeps them from losing context.

Ricardo Costa ~14 min read

The previous posts built the case for why AIRUP works. This one shows what it looks like. Consider this a tour of the factory floor — the agents, the artifacts they produce, the directory structure they populate, and the protocol that prevents information from getting lost between phases. No theory today. Just the machinery.

The Cast

The RUP AI Kit has six agents. Five are direct mappings of classical RUP roles. One — the Governor — is new, as I explained in Post #003. Each agent has a persona, a defined scope, and clear boundaries on what it can and cannot do.

Agent RUP Role Produces Cannot Do
👑 Governor New role (no RUP equivalent) progression.md, routing decisions, escalations Write specs, code, or tests. Only orchestrates.
📋 Business Analyst Business Analyst vision.md, glossary.md, stakeholders.md, business-rules.md, business-processes.md Write requirements, design, or code.
📋 Requirements Analyst Requirements Specifier requirements.md, use_cases.md Make architecture decisions or write code.
🏛️ Architect Software Architect architecture.md, design_decisions.md, sequence_diagrams.md, deployment.md Write implementation code or tests.
🔀 Developer Developer Source code, TDD scaffolds in tests/dev/, implementation notes Write verification tests. Only dev scaffolds.
🧪 QA Analyst Tester traceability_matrix.md, test_plan.md, verification_report.md, tests in tests/qa/ Fix bugs or modify source code.

The "Cannot Do" column is as important as the "Produces" column. Role separation is enforced, not suggested. The Developer writes code but cannot write verification tests — that's the QA's job. The QA can write independent tests in tests/qa/ but cannot modify the Developer's code. The Governor orchestrates but never produces artifacts directly. These boundaries prevent the kind of role confusion that plagues ungoverned multi-agent systems, where every agent tries to do everything.

"In a team of humans, role boundaries are social conventions. In a team of AI agents, they're system constraints. The latter is more reliable."

The SDD Directory Structure

Every project managed by the RUP AI Kit follows a standard directory layout. The specification artifacts live in spec/docs/, organized by RUP discipline. The code lives in src/. Tests are split between tests/dev/ (Developer's TDD scaffolds) and tests/qa/ (QA's independent verification tests).

The SDD Tree

spec/docs/
  00-overview/    → progression.md
  01-business/    → vision, glossary, stakeholders, rules, processes
  02-requirements/ → requirements.md, use_cases.md
  03-design/      → architecture, decisions, sequences
  04-implementation/ → implementation notes, conventions
  05-test/        → traceability matrix, test plan, verification report
  06-deployment/  → deployment spec, infra notes
  07-change-mgmt/ → tech debt register, evolution roadmap
  tasks/          → task decomposition for implementation

Each directory maps to a RUP discipline. Each agent knows which directories it owns and which it can only read. The Business Analyst writes to 01-business/. The Architect writes to 03-design/ and 06-deployment/. The QA writes to 05-test/ and 07-change-mgmt/. Nobody writes outside their scope.

The one exception is 00-overview/progression.md — the Governor's territory. More on that in a moment.

The Pipeline: Two Directions

The Kit supports two pipeline configurations, depending on whether you're starting from scratch or reverse-engineering an existing codebase.

Greenfield: Idea → Code

Greenfield Pipeline
👑 Gov
📋 Business
📋 Requirements
🏛️ Architect
🔀 Developer
🧪 QA

This is the forward flow: understand the business, specify requirements, design the architecture, implement, verify. The Governor bootstraps progression.md, routes to each agent in sequence, runs quality gates between phases, and makes the final GO / CONDITIONAL_GO / NO_GO call.

Brownfield: Code → Spec

Brownfield / Reverse Engineering Pipeline
👑 Gov
🔀 Developer
🏛️ Architect
📋 Requirements
📋 Business
🧪 QA

The reverse flow starts with code analysis. The Developer reads the existing codebase and produces implementation notes. The Architect infers the architecture and documents decisions. The Requirements Analyst extracts functional and non-functional requirements from code behavior. The Business Analyst reconstructs the business context. QA closes the loop with a traceability audit: does every requirement map to existing code? Are there code paths that don't trace to any requirement?

This is the pipeline I used on a real brownfield project — a Java microservice with 167 classes, Spring Boot 2.7, MySQL, Kafka, and zero documentation. The Kit produced 24 specification artifacts (488 KB of structured markdown), identified 83 functional requirements, 22 technical debts (4 critical), and delivered a CONDITIONAL_GO verdict with a six-phase evolution roadmap. All from reading code.


The Progression Protocol

This is the feature that changed everything, and it wasn't in the original design. It came from a bug.

Early in the Kit's development, I noticed a pattern: the Business Analyst would identify a key constraint (e.g., "all transactions must be idempotent"). The Requirements Analyst would capture it as a requirement. But by the time the Developer started implementing, three phases later, the constraint had faded from context. The Developer would produce non-idempotent code. QA would catch it. The Governor would route back to the Developer. The Developer would fix it. Three wasted iterations because information decayed across phase boundaries.

I call this the telephone game problem — each agent passes context to the next, and each handoff introduces information loss. The specs themselves captured the what (requirements, design decisions), but they didn't capture the why it matters, the watch out for this, or the I'm not sure about that.

The Progression Protocol solves this with a single artifact: progression.md.

What Is progression.md?

An append-only log maintained exclusively by the Governor. After each agent completes its phase, the Governor conducts a structured debrief (five standard questions) and writes a Handoff Entry summarizing:

• What the agent produced and key decisions made
Unresolved Questions (UQ-NNN) — things the agent couldn't answer, inherited by downstream agents
Assumptions (AS-NNN) — things the agent assumed but couldn't verify, flagged as mandatory verification targets for QA
Confidence assessment (🟢 high / 🟡 medium / 🔴 low) per knowledge area

The critical design choices:

The Debrief Protocol asks five questions at the end of every phase:

  1. What did you produce? (artifacts list)
  2. What decisions did you make and why?
  3. What couldn't you determine? (unresolved questions)
  4. What did you assume without verifying? (assumptions)
  5. What should the next agent watch out for?

Question 5 is the secret weapon. It captures tacit knowledge — the kind of thing a human engineer would mention in a hallway conversation but never write in a spec. "The payment module has a weird race condition when two requests hit simultaneously." "The customer said they want real-time updates, but I think they'd accept 30-second polling." These insights don't belong in requirements or design documents. They belong in the progression log, where downstream agents can benefit from them.

"Specifications tell agents what to build. The progression log tells them what to watch out for while building it."

The Construction Protocol

The most recent evolution of the Kit added a structured protocol for the Construction phase — where the Developer and QA work together task by task. This was born from another observed failure: the Developer would implement all tasks in one batch, then QA would review everything at once, find issues scattered across multiple tasks, and the rework loop became chaotic.

The Construction Protocol enforces a task-by-task rhythm:

Construction Phase Flow (per task)
👑 Gov
route task
🔀 Dev
implement
🧪 QA
verify
👑 Gov
gate

The Governor routes one task at a time. The Developer implements it and writes a TDD scaffold in tests/dev/. QA verifies the implementation against four mechanical checklists:

Each finding gets a typed ID: [API-001], [DOM-003], [REQ-012], [BR-005]. If findings exist, the Governor routes back to the Developer. Maximum 3 cycles before the Governor escalates to the human. This cap prevents the infinite refinement loops I described in the Governor post.

The Anti-Tautology Protocol

One of the subtler problems in AI-driven testing: agents write tests that look right but verify nothing. A test that asserts result is not None when the function always returns something is technically passing but functionally useless. I call these tautological tests.

The QA agent includes an Anti-Tautology Protocol that flags seven patterns:

When the QA finds tautological tests in the Developer's tests/dev/, it flags them. More importantly, the QA writes its own independent verification tests in tests/qa/, providing a second layer of test coverage that doesn't depend on the Developer's assumptions.


The Governor Up Close

I've described the Governor's five strategies and its role in the library system case study. Here's what it actually does in the Kit, operationally:

Bootstrap

When a new project starts, the Governor detects the mode (Greenfield, Brownfield, or Evolve) and creates the directory structure. It reads existing code if present, initializes progression.md, and determines the pipeline direction.

Routing

The Governor decides which agent to invoke next. This is usually sequential (Business → Requirements → Architect → Developer → QA), but the Governor can re-route if a quality gate fails. It also handles the Construction phase's task-by-task loop.

Quality Gates

Between every phase, the Governor runs deterministic checks: ID uniqueness, cross-reference completeness, EARS syntax validation, traceability chain coverage. These catch structural issues at zero token cost.

Progression Management

After each agent's debrief, the Governor writes the Handoff Entry. It carries forward unresolved questions and assumptions. It assesses confidence per area. It provides downstream agents with curated context — not just the raw specs, but the interpretive layer that makes the specs actionable.

Circuit Breaking

Maximum iterations per loop. Convergence detection. Cost ceiling per phase. Escalation to human when the Governor can't resolve a dispute between agents.

Final Verdict

The Governor aggregates the QA's verification report, the traceability matrix, and the unresolved questions into a final verdict:

Verdict Meaning Criteria
GO Ship it All requirements traced and verified. Zero critical tech debt. No unresolved questions.
CONDITIONAL_GO Ship with known risks Requirements met but tech debt exists or assumptions unverified. Evolution roadmap required.
NO_GO Do not ship Critical requirements unmet, structural defects, or unresolvable contradictions.

A Real Brownfield Run

To give this concreteness: I ran the Brownfield pipeline on an existing Java microservice — a judicial block processing system with Spring Boot 2.7, MySQL, Kafka, and ShedLock. The codebase had 167 production classes, 63 test classes, and zero specification documents. Nobody had ever written down what the system was supposed to do.

The Kit produced:

Total output: 24 artifacts, 488 KB of structured markdown. All extracted from code analysis. All cross-referenced with traceability IDs. All living in the repository next to the code.

The team that maintains this microservice had been working on it for over a year without any of this documentation. It took the Kit approximately two hours to produce what would have taken a human analyst weeks — and what, in practice, would never have been produced at all.

"The best documentation is the documentation that exists. The RUP AI Kit's real contribution isn't better documentation — it's making documentation economically viable."

What Building the Kit Taught Me

The Governor is the hardest agent to build

The five specialists have clear scopes: read these inputs, produce these outputs. The Governor has to judge — when to re-route, when to escalate, when to accept "good enough." Every bad judgment compounds downstream. It's also the agent whose prompt evolved the most: from 4K chars in v1 to over 16K chars in the current version, with the Progression Protocol and Construction Protocol adding significant complexity.

Role separation pays for itself

The strict "cannot do" boundaries initially felt like artificial constraints. Why prevent the Developer from writing verification tests? Because when one agent both produces and verifies its own work, it tests its implementation, not the specification. The QA, working from the spec without seeing the code first, catches a different class of bugs. The separation creates genuine independent verification.

The Progression Protocol was the biggest upgrade

Before the Protocol, agents produced excellent individual artifacts but lost context between phases. The Protocol didn't change what the agents produce — it changed what they know about what came before. That single change reduced rework loops by roughly 40% in my experiments.

Brownfield is harder than greenfield

Reverse-engineering specs from existing code requires the agents to infer intent from implementation — the opposite of normal flow. The Business Analyst has to guess why a piece of code exists, not just what it does. Requirements are labeled as "INFERRED" or "OBSERVED" rather than "DEFINED." This creates a fundamentally different confidence profile, and the QA has to work harder to verify inferences against behavior.

488 KB of specs is not too much

When I saw the brownfield output, my first reaction was "this is too much documentation." Then I realized: the cost of producing it was near-zero (AI agents, cheap tokens). The cost of not having it was measured in developer hours spent reading code, asking colleagues, and reverse-engineering behavior. 488 KB of specs saves thousands of hours of context recovery. It's not over-documentation — it's documentation at the price point where it finally makes sense.


What's Next for the Kit

The current Kit (v2) implements three core mechanisms: the Progression Protocol, the Construction Protocol with task-by-task flow, and the Anti-Tautology Protocol. Two more waves are planned but not yet implemented:

Wave 2: Mechanical gates. Currently, quality gates are a mix of deterministic checks and LLM-based review. Wave 2 would add hard mechanical gates: compilation must pass, linter must pass, test suite must execute, coverage must exceed threshold. These run before the QA agent even looks at the code — catching basic errors at zero token cost.

Wave 3: Integration testing. Testcontainers for database integration tests, contract testing between components, end-to-end flow verification with realistic test data. This is where the Kit moves from "does it compile and pass unit tests?" to "does it actually work in a realistic environment?"

These waves are proposals — not commitments. The Kit evolves based on what the experiments reveal. If Wave 2 gates catch issues that the current Protocol already catches, they're redundant. If they catch a new category of issue, they're valuable. Evidence first, features second.

Next up: The Economics of AI-Driven Development — the series finale. What these experiments tell us about the real cost structure of AI-driven software development, why formal processes are now economically viable, and what it means when the price of intelligence reaches zero.

RUP AI Kit AIRUP Multi-Agent Systems Progression Protocol Brownfield Reverse Engineering AI Governor SDD Quality Assurance
RC

Ricardo Costa

Software engineer exploring the intersection of classical software processes and AI-driven development. Currently pursuing a master's degree researching AIRUP — an AI-first approach to the Rational Unified Process.