Case Study · Architecture

Multi-Agent PR Review System

Coordinated agents that review pull requests for architecture consistency, dependency impact, test coverage, and historical context — operating as a team rather than a single reviewer.

Private project

Status: BUILDER
Year: 2026
Role: Solo build
Stack: ClaudeGitGraphRAG

Problem

Code review scales linearly with the reviewer's working context. A reviewer familiar with the changed service, its downstream consumers, the conventions of the surrounding module, and the historical decisions that shaped the code can produce a high-quality review in minutes. A reviewer without that context produces feedback that catches surface-level issues and misses the structural ones.

Large pull requests compound the problem. A change that touches ten files may depend on conventions, services, and prior decisions that no single reviewer holds in working memory. The author often has the most context and is the least objective. Senior reviewers have the most context and the least time. Junior reviewers have the most time and the least context.

The traditional remedy is process: required reviewers, approval checklists, architecture review boards. Process adds latency without expanding reviewer context. It scales organizational structure around a problem that is fundamentally about information retrieval and synthesis — and is therefore bounded by what each reviewer can hold in mind.

LLM-assisted review does not solve this by itself. A language model asked to review a pull request with no architectural context produces fluent, plausible, and often wrong feedback. The model does not know the system. It does not know the conventions. It does not know which decisions were deliberate.

Architecture Overview

The platform is structured as a pipeline that progressively enriches a pull request before reasoning begins. Repository analysis extracts the diff and links it into the existing knowledge graph. The graph is traversed to surface blast radius: dependencies, related services, historical decisions, and architectural rules. Retrieval augmentation combines graph traversal with vector similarity to assemble a single context package. Specialized agents review the assembled context in parallel. An aggregator synthesizes their findings into one review package.

The architectural choice is deliberate: do not ask a single language model to review a pull request directly. A single model has no guaranteed access to architectural context. The quality of its review is bounded by what it sees in its prompt — not by what it could reason about given the right context.

The pipeline trades compositional complexity for review quality. Each stage is independently debuggable, replaceable, and improvable. The knowledge graph is the canonical artifact; every other stage consumes or produces graph data. The cost is operational: more components to run, more failure modes to handle, and more latency between webhook and review.

Instead of asking a single language model to review a pull request directly, the system assembles architectural context, routes work to specialized agents, and aggregates recommendations into a single review package.

Architecture Diagram

The pipeline shows the seven stages from raw pull request to posted review. Each stage transforms the representation of the change in a specific way: from a webhook payload to a structured diff, from a diff to a context package, from a context package to per-agent findings, and from per-agent findings to a single aggregated review.

Reading left to right, the system trades a small number of well-defined transformations for the ability to reason about code at the architectural level rather than at the line level.

Linear pipeline from PR webhook to posted review. The Context Builder feeds the Agent Router; the Router fans out to specialized agents; the Aggregator merges their findings into a single review.

Each stage progressively enriches the pull request. Raw code changes enter the system as a webhook payload. The Diff Parser normalizes them into a structured representation of added, removed, and modified lines per file. This stage is intentionally narrow: it extracts facts and does not interpret meaning.

The Context Builder consumes the diff and links it into the repository knowledge graph, surfacing the affected files, functions, services, dependencies, conventions, and architectural rules. The output is a context package that is structured enough for agents to consume and complete enough for them to reason.

The Agent Router dispatches the assembled context to specialized agents in parallel. Each agent returns findings for its review dimension. The Aggregator deduplicates overlapping observations, reconciles conflicts, ranks findings by severity, and emits a single review package that a human reviewer can act on.

Knowledge Graph Model

The Pull Request acts as the root entity within the context graph.

Every pull request is decomposed into a collection of connected entities that describe both the changed code and the system it lives in.

The graph expands to include the following entities:

Files
Functions
Services
Dependencies
Tests
Conventions
Architecture Rules
Historical Issues

The context the PR lives in. Each satellite is a lens the agents use to evaluate the change.

These entities are extracted from the diff, the repository knowledge graph, the architectural rule store, and the historical issue tracker.

The graph provides the real value. Instead of reasoning about the changed lines in isolation, the agents see the files affected, the functions called, the services downstream, the tests that exercise the code, the conventions the file should follow, the architectural rules that apply, and the historical issues that the change might resemble.

This enables reasoning beyond the changed lines themselves. A reviewer — human or agent — who can see the affected service, its downstream consumers, and the conventions of similar files in the same module will catch issues that a line-level reviewer cannot.

Context Retrieval Flow

Pure vector search retrieves entities that are semantically similar to the diff. This works well for isolated pieces of information but struggles when the relevant context depends on relationships across multiple parts of the system.

GraphRAG combines semantic retrieval with structural traversal. When the diff is embedded, the embedding identifies candidate entities — files, functions, services, tests. The graph is then traversed outward from those candidates to surface related entities through dependencies, ownership, and historical co-occurrence.

The two result sets are merged, ranked, and assembled into a context package before the agents run.

Two parallel retrieval paths — graph traversal and vector search — rejoin at Context Assembly before the agents run.

The retrieval process expands context through connected entities. For a change to a database access layer, the traversal may surface the owning service, the upstream API that calls it, the downstream consumers of the changed method, the test suite that exercises it, and the architectural rule that governs database access patterns.

The resulting context package provides local understanding — what the changed code does — and architectural understanding — what the change means in the context of the surrounding system. Both are required before reasoning begins.

Without graph traversal, the agents see the changed lines and the immediately similar code. They can identify local issues but not structural ones. Without vector search, the traversal is bounded by what the diff directly references. It cannot surface related entities that share semantic similarity but not syntactic overlap.

Multi-Agent Workflow

The agent layer provides specialized reasoning capabilities on top of the assembled context.

Instead of asking a single language model to handle every review concern, the platform routes the context package to four specialized agents — each optimized for a different review dimension.

The four agents run in parallel. Each receives the same context package and applies its own reasoning strategy.

Four specialized agents run in parallel against the assembled context; the Aggregator resolves conflicts and produces a single review.

The Code Quality Agent focuses on implementation details: naming, structure, duplication, clarity, idiomatic patterns for the language and framework. It produces findings a human reviewer would otherwise catch on a careful read.

The Architecture Agent focuses on system boundaries: service responsibilities, dependency direction, coupling, consistency with established patterns in the module. Its findings require cross-module context — the kind a human reviewer cannot reliably hold in working memory.

The Security Agent focuses on threat surface: input validation, authentication and authorization, secrets handling, dependency vulnerabilities, and data exposure. Its findings are prioritized by severity; secret leakage and unauthenticated access rank above style issues.

The Testing Agent focuses on coverage and correctness: whether the changes are tested, whether the tests are meaningful, whether the tests cover the affected behavior, and whether adjacent code carries regression risks.

The Aggregator receives all four findings. It deduplicates overlapping observations, resolves conflicts when one agent flags an issue another does not, ranks findings by severity and confidence, and emits a single review package. The package is presented to a human reviewer who decides what to act on.

Technology Decisions

The platform makes five decisions that together define the architecture. Each has a real alternative and a real tradeoff.

Decision	Choice	Rationale
Retrieval strategy	GraphRAG (vector + graph traversal)	Vector search surfaces semantically similar code but does not capture dependency direction, ownership, or architectural rules. GraphRAG combines embedding-based retrieval with structural traversal so the agents see both similarity and relationships.
Agent organization	Four specialized agents + aggregator	A single agent asked to evaluate code quality, architecture, security, and testing produces shallow coverage across all four. Specialized agents with focused evaluation criteria produce deeper coverage on their dimension. The aggregator reconciles conflicts between them.
Context store	Pre-built repository knowledge graph, updated incrementally	Constructing the graph on every PR is too slow. A pre-built graph — populated by an indexer that walks the repository and updates on commit — provides constant-time lookups during review. The graph is a separate system with its own update pipeline.
Reasoning shape	Context assembly before any agent runs	Each agent receives a pre-assembled context package. This avoids redundant retrieval across agents, ensures every agent sees the same baseline, and makes the context a first-class artifact that can be logged, versioned, and audited.
Posting model	Human approval required before review is posted	Agents produce findings; humans decide what to act on. The review package is a recommendation, not a directive. This preserves accountability and lets the team calibrate agent confidence over time.

These decisions are not the only reasonable choices. They are the choices that match the system's purpose: produce a review package that an engineering team can act on, with full architectural context, while humans remain accountable for what reaches the codebase.

Challenges

Building a multi-agent review system surfaces four classes of difficulty that are not always obvious during design.

False Positives

Agents flag issues that, on inspection, are not actually problems. A convention violation that looks like a bug but is deliberate. A security warning about internal-only code. A testing recommendation for code that is deliberately untested — generated code, fixtures, migrations, scripts.

The cost of false positives is reviewer fatigue. A reviewer who learns to dismiss agent warnings eventually stops reading them. The system must tune for precision over recall, even when that means some genuine issues slip through.

Context Overload

Context assembly can produce more entities than the agent can effectively use. Files, functions, dependencies, conventions, historical issues — all relevant, all useful, but collectively exceeding the practical attention budget of a single agent invocation.

The system must rank and truncate. Truncation can hide the very entity the agent most needed to see. Tuning the ranker is an ongoing effort that requires labeled data the platform is still building.

Cross-Repository Impact Analysis

A change in one repository may affect another through a shared library, a generated client, a deployment dependency, or a contract. The current system models a single repository's knowledge graph and therefore cannot reason about cross-repository blast radius.

Cross-repository impact requires either a federation layer that queries multiple graphs or a unified graph that spans repositories. Both approaches add latency to retrieval and operational complexity to indexing. The cost is currently paid for simplicity.

Conflicting Agent Recommendations

The Architecture Agent may recommend extracting a shared module; the Code Quality Agent may recommend inlining it for readability. The Aggregator's job is to reconcile these, but the reconciliation rules are themselves a design decision.

Hand-tuned rules are brittle as the system grows and the agent set expands. Learned reconciliation requires evaluation infrastructure — labeled examples of conflicting findings paired with the resolution a human reviewer would choose — that does not yet exist.

Future Roadmap

Five directions are queued for the next phase of work. Each addresses a known limitation in the current system.

Historical Learning

Future versions will learn from prior agent findings, reviewer responses, and eventual outcomes. When an agent flag is consistently overridden by reviewers, the system should learn to suppress it. When a flag is consistently acted on, the system should learn to elevate it.

This requires evaluation infrastructure — a feedback loop between the agent, the reviewer, and the eventual outcome (merged, reverted, incident) — that does not yet exist.

Repository-Specific Rules

Today, architectural rules are stored centrally. Future versions will support rules defined per repository — sometimes per directory — that override or extend the central set.

This lets each team encode its own conventions without requiring changes to the central system, and lets the rule set evolve at the speed of the team rather than the speed of the platform.

Team Conventions

Conventions that exist in code but are not documented anywhere are the hardest to enforce. Future versions will mine the repository for recurring patterns — test file naming, error handling structure, log format, comment style — and surface them as conventions the agents can apply.

Mining conventions requires a definition of what counts as a convention versus noise. This is unsolved; early experiments produce high-recall, low-precision suggestions that require human filtering.

Cross-Repository Reasoning

The current graph models a single repository. Future versions will connect multiple repositories into a unified engineering knowledge graph, enabling architectural reasoning across services, platforms, teams, and organizational boundaries.

This unlocks questions the single-repository graph cannot answer — which services in the platform depend on a changed library, which deployment pipelines are affected by a configuration change, which teams share architectural patterns that should be aligned.

Autonomous Remediation Suggestions

Future versions will not only flag issues but propose concrete code changes. The proposal will be a diff the human reviewer can accept, modify, or reject.

This moves the system from review to review-and-suggest, reducing the round-trip between finding and fix. The risk is autonomous changes that are technically correct but contextually wrong; human approval remains the gating mechanism for any change that touches the codebase.