The previous posts built the case for why AIRUP works. This one shows what it looks like. Consider this a tour of the factory floor — the agents, the artifacts they produce, the directory structure they populate, and the protocol that prevents information from getting lost between phases. No theory today. Just the machinery.
The Cast
The RUP AI Kit has six agents. Five are direct mappings of classical RUP roles. One — the Governor — is new, as I explained in Post #003. Each agent has a persona, a defined scope, and clear boundaries on what it can and cannot do.
| Agent | RUP Role | Produces | Cannot Do | |
|---|---|---|---|---|
| 👑 | Governor | New role (no RUP equivalent) | progression.md, routing decisions, escalations | Write specs, code, or tests. Only orchestrates. |
| 📋 | Business Analyst | Business Analyst | vision.md, glossary.md, stakeholders.md, business-rules.md, business-processes.md | Write requirements, design, or code. |
| 📋 | Requirements Analyst | Requirements Specifier | requirements.md, use_cases.md | Make architecture decisions or write code. |
| 🏛️ | Architect | Software Architect | architecture.md, design_decisions.md, sequence_diagrams.md, deployment.md | Write implementation code or tests. |
| 🔀 | Developer | Developer | Source code, TDD scaffolds in tests/dev/, implementation notes | Write verification tests. Only dev scaffolds. |
| 🧪 | QA Analyst | Tester | traceability_matrix.md, test_plan.md, verification_report.md, tests in tests/qa/ | Fix bugs or modify source code. |
The "Cannot Do" column is as important as the "Produces" column. Role separation is enforced, not suggested. The Developer writes code but cannot write verification tests — that's the QA's job. The QA can write independent tests in tests/qa/ but cannot modify the Developer's code. The Governor orchestrates but never produces artifacts directly. These boundaries prevent the kind of role confusion that plagues ungoverned multi-agent systems, where every agent tries to do everything.
"In a team of humans, role boundaries are social conventions. In a team of AI agents, they're system constraints. The latter is more reliable."
The SDD Directory Structure
Every project managed by the RUP AI Kit follows a standard directory layout. The specification artifacts live in spec/docs/, organized by RUP discipline. The code lives in src/. Tests are split between tests/dev/ (Developer's TDD scaffolds) and tests/qa/ (QA's independent verification tests).
The SDD Tree
spec/docs/
00-overview/ → progression.md
01-business/ → vision, glossary, stakeholders, rules, processes
02-requirements/ → requirements.md, use_cases.md
03-design/ → architecture, decisions, sequences
04-implementation/ → implementation notes, conventions
05-test/ → traceability matrix, test plan, verification report
06-deployment/ → deployment spec, infra notes
07-change-mgmt/ → tech debt register, evolution roadmap
tasks/ → task decomposition for implementation
Each directory maps to a RUP discipline. Each agent knows which directories it owns and which it can only read. The Business Analyst writes to 01-business/. The Architect writes to 03-design/ and 06-deployment/. The QA writes to 05-test/ and 07-change-mgmt/. Nobody writes outside their scope.
The one exception is 00-overview/progression.md — the Governor's territory. More on that in a moment.
The Pipeline: Two Directions
The Kit supports two pipeline configurations, depending on whether you're starting from scratch or reverse-engineering an existing codebase.
Greenfield: Idea → Code
This is the forward flow: understand the business, specify requirements, design the architecture, implement, verify. The Governor bootstraps progression.md, routes to each agent in sequence, runs quality gates between phases, and makes the final GO / CONDITIONAL_GO / NO_GO call.
Brownfield: Code → Spec
The reverse flow starts with code analysis. The Developer reads the existing codebase and produces implementation notes. The Architect infers the architecture and documents decisions. The Requirements Analyst extracts functional and non-functional requirements from code behavior. The Business Analyst reconstructs the business context. QA closes the loop with a traceability audit: does every requirement map to existing code? Are there code paths that don't trace to any requirement?
This is the pipeline I used on a real brownfield project — a Java microservice with 167 classes, Spring Boot 2.7, MySQL, Kafka, and zero documentation. The Kit produced 24 specification artifacts (488 KB of structured markdown), identified 83 functional requirements, 22 technical debts (4 critical), and delivered a CONDITIONAL_GO verdict with a six-phase evolution roadmap. All from reading code.
The Progression Protocol
This is the feature that changed everything, and it wasn't in the original design. It came from a bug.
Early in the Kit's development, I noticed a pattern: the Business Analyst would identify a key constraint (e.g., "all transactions must be idempotent"). The Requirements Analyst would capture it as a requirement. But by the time the Developer started implementing, three phases later, the constraint had faded from context. The Developer would produce non-idempotent code. QA would catch it. The Governor would route back to the Developer. The Developer would fix it. Three wasted iterations because information decayed across phase boundaries.
I call this the telephone game problem — each agent passes context to the next, and each handoff introduces information loss. The specs themselves captured the what (requirements, design decisions), but they didn't capture the why it matters, the watch out for this, or the I'm not sure about that.
The Progression Protocol solves this with a single artifact: progression.md.
What Is progression.md?
An append-only log maintained exclusively by the Governor. After each agent completes its phase, the Governor conducts a structured debrief (five standard questions) and writes a Handoff Entry summarizing:
• What the agent produced and key decisions made
• Unresolved Questions (UQ-NNN) — things the agent couldn't answer, inherited by downstream agents
• Assumptions (AS-NNN) — things the agent assumed but couldn't verify, flagged as mandatory verification targets for QA
• Confidence assessment (🟢 high / 🟡 medium / 🔴 low) per knowledge area
The critical design choices:
- Append-only: Nobody edits previous entries. The log is a faithful record of what each agent knew and didn't know at handoff time.
- Governor-maintained: Only the Governor writes to
progression.md. Agents provide raw debrief answers; the Governor synthesizes them into the Handoff Entry. This ensures consistency and prevents agents from editorializing their own performance. - Inherited questions: An unresolved question (UQ-003) raised by the Business Analyst doesn't disappear. It gets carried forward in every subsequent Handoff Entry until an agent resolves it — or until the Governor escalates it to the human.
- Assumptions are verification targets: When the Requirements Analyst assumes "the system handles at most 1,000 concurrent users" (AS-005), that assumption is automatically added to the QA's verification checklist. If QA can't verify it, the assumption is flagged in the final verdict.
The Debrief Protocol asks five questions at the end of every phase:
- What did you produce? (artifacts list)
- What decisions did you make and why?
- What couldn't you determine? (unresolved questions)
- What did you assume without verifying? (assumptions)
- What should the next agent watch out for?
Question 5 is the secret weapon. It captures tacit knowledge — the kind of thing a human engineer would mention in a hallway conversation but never write in a spec. "The payment module has a weird race condition when two requests hit simultaneously." "The customer said they want real-time updates, but I think they'd accept 30-second polling." These insights don't belong in requirements or design documents. They belong in the progression log, where downstream agents can benefit from them.
"Specifications tell agents what to build. The progression log tells them what to watch out for while building it."
The Construction Protocol
The most recent evolution of the Kit added a structured protocol for the Construction phase — where the Developer and QA work together task by task. This was born from another observed failure: the Developer would implement all tasks in one batch, then QA would review everything at once, find issues scattered across multiple tasks, and the rework loop became chaotic.
The Construction Protocol enforces a task-by-task rhythm:
route task
implement
verify
gate
The Governor routes one task at a time. The Developer implements it and writes a TDD scaffold in tests/dev/. QA verifies the implementation against four mechanical checklists:
- API Contract — Do endpoints match the design spec? Correct methods, paths, status codes?
- Domain Model — Do entities match the architecture? Correct fields, types, constraints?
- Requirements Traceability — Does the code implement the requirement it claims to? Is any requirement logic missing?
- Business Rules — Are business rules from
01-business/business-rules.mdcorrectly encoded?
Each finding gets a typed ID: [API-001], [DOM-003], [REQ-012], [BR-005]. If findings exist, the Governor routes back to the Developer. Maximum 3 cycles before the Governor escalates to the human. This cap prevents the infinite refinement loops I described in the Governor post.
The Anti-Tautology Protocol
One of the subtler problems in AI-driven testing: agents write tests that look right but verify nothing. A test that asserts result is not None when the function always returns something is technically passing but functionally useless. I call these tautological tests.
The QA agent includes an Anti-Tautology Protocol that flags seven patterns:
- Assertions on
is not Noneor truthiness when the return type guarantees non-null - Assertions on
len(x) > 0without verifying what is in the collection - Tests that only verify the happy path with no failure/edge cases
- Mock setups that return the exact value being asserted (the test proves the mock works, not the code)
- Tests that duplicate another test with trivially different inputs
- Tests with no domain-relevant assertion (only structural checks)
- Tests where the description doesn't match what's actually verified
When the QA finds tautological tests in the Developer's tests/dev/, it flags them. More importantly, the QA writes its own independent verification tests in tests/qa/, providing a second layer of test coverage that doesn't depend on the Developer's assumptions.
The Governor Up Close
I've described the Governor's five strategies and its role in the library system case study. Here's what it actually does in the Kit, operationally:
Bootstrap
When a new project starts, the Governor detects the mode (Greenfield, Brownfield, or Evolve) and creates the directory structure. It reads existing code if present, initializes progression.md, and determines the pipeline direction.
Routing
The Governor decides which agent to invoke next. This is usually sequential (Business → Requirements → Architect → Developer → QA), but the Governor can re-route if a quality gate fails. It also handles the Construction phase's task-by-task loop.
Quality Gates
Between every phase, the Governor runs deterministic checks: ID uniqueness, cross-reference completeness, EARS syntax validation, traceability chain coverage. These catch structural issues at zero token cost.
Progression Management
After each agent's debrief, the Governor writes the Handoff Entry. It carries forward unresolved questions and assumptions. It assesses confidence per area. It provides downstream agents with curated context — not just the raw specs, but the interpretive layer that makes the specs actionable.
Circuit Breaking
Maximum iterations per loop. Convergence detection. Cost ceiling per phase. Escalation to human when the Governor can't resolve a dispute between agents.
Final Verdict
The Governor aggregates the QA's verification report, the traceability matrix, and the unresolved questions into a final verdict:
| Verdict | Meaning | Criteria |
|---|---|---|
| GO | Ship it | All requirements traced and verified. Zero critical tech debt. No unresolved questions. |
| CONDITIONAL_GO | Ship with known risks | Requirements met but tech debt exists or assumptions unverified. Evolution roadmap required. |
| NO_GO | Do not ship | Critical requirements unmet, structural defects, or unresolvable contradictions. |
A Real Brownfield Run
To give this concreteness: I ran the Brownfield pipeline on an existing Java microservice — a judicial block processing system with Spring Boot 2.7, MySQL, Kafka, and ShedLock. The codebase had 167 production classes, 63 test classes, and zero specification documents. Nobody had ever written down what the system was supposed to do.
The Kit produced:
- 5 business artifacts — vision (problem statement + positioning), glossary (30 domain terms + 15 acronyms), stakeholder map (15 stakeholders with influence×interest matrix), 28 business rules, 6 macro business processes with Mermaid flowcharts
- 2 requirements artifacts — 83 functional requirements (RF-01 to RF-83) in EARS notation across 13 domains, 30 non-functional requirements (NFR-01 to NFR-30), 18 use cases with main/alternative/exception flows
- 4 design artifacts — hexagonal architecture documentation, 9 ADRs, component diagrams, sequence diagrams for critical flows
- Quality verdict: CONDITIONAL_GO — 22 technical debts identified (4 critical: Spring Boot EOL, credentials in git, no authentication, no circuit breaker), 54% test coverage, evolution roadmap with 6 phases (~26 weeks estimated)
Total output: 24 artifacts, 488 KB of structured markdown. All extracted from code analysis. All cross-referenced with traceability IDs. All living in the repository next to the code.
The team that maintains this microservice had been working on it for over a year without any of this documentation. It took the Kit approximately two hours to produce what would have taken a human analyst weeks — and what, in practice, would never have been produced at all.
"The best documentation is the documentation that exists. The RUP AI Kit's real contribution isn't better documentation — it's making documentation economically viable."
What Building the Kit Taught Me
The Governor is the hardest agent to build
The five specialists have clear scopes: read these inputs, produce these outputs. The Governor has to judge — when to re-route, when to escalate, when to accept "good enough." Every bad judgment compounds downstream. It's also the agent whose prompt evolved the most: from 4K chars in v1 to over 16K chars in the current version, with the Progression Protocol and Construction Protocol adding significant complexity.
Role separation pays for itself
The strict "cannot do" boundaries initially felt like artificial constraints. Why prevent the Developer from writing verification tests? Because when one agent both produces and verifies its own work, it tests its implementation, not the specification. The QA, working from the spec without seeing the code first, catches a different class of bugs. The separation creates genuine independent verification.
The Progression Protocol was the biggest upgrade
Before the Protocol, agents produced excellent individual artifacts but lost context between phases. The Protocol didn't change what the agents produce — it changed what they know about what came before. That single change reduced rework loops by roughly 40% in my experiments.
Brownfield is harder than greenfield
Reverse-engineering specs from existing code requires the agents to infer intent from implementation — the opposite of normal flow. The Business Analyst has to guess why a piece of code exists, not just what it does. Requirements are labeled as "INFERRED" or "OBSERVED" rather than "DEFINED." This creates a fundamentally different confidence profile, and the QA has to work harder to verify inferences against behavior.
488 KB of specs is not too much
When I saw the brownfield output, my first reaction was "this is too much documentation." Then I realized: the cost of producing it was near-zero (AI agents, cheap tokens). The cost of not having it was measured in developer hours spent reading code, asking colleagues, and reverse-engineering behavior. 488 KB of specs saves thousands of hours of context recovery. It's not over-documentation — it's documentation at the price point where it finally makes sense.
What's Next for the Kit
The current Kit (v2) implements three core mechanisms: the Progression Protocol, the Construction Protocol with task-by-task flow, and the Anti-Tautology Protocol. Two more waves are planned but not yet implemented:
Wave 2: Mechanical gates. Currently, quality gates are a mix of deterministic checks and LLM-based review. Wave 2 would add hard mechanical gates: compilation must pass, linter must pass, test suite must execute, coverage must exceed threshold. These run before the QA agent even looks at the code — catching basic errors at zero token cost.
Wave 3: Integration testing. Testcontainers for database integration tests, contract testing between components, end-to-end flow verification with realistic test data. This is where the Kit moves from "does it compile and pass unit tests?" to "does it actually work in a realistic environment?"
These waves are proposals — not commitments. The Kit evolves based on what the experiments reveal. If Wave 2 gates catch issues that the current Protocol already catches, they're redundant. If they catch a new category of issue, they're valuable. Evidence first, features second.
Next up: The Economics of AI-Driven Development — the series finale. What these experiments tell us about the real cost structure of AI-driven software development, why formal processes are now economically viable, and what it means when the price of intelligence reaches zero.