Cross-Project Auditing with Parallel AI Agents: A Methodology for Multi-Service Codebases

Why Single-PR Reviews Fall Short for Multi-Service Work

When you add multi-platform video rendering support across three codebases — a content management Rails app, an analytics Rails service, and a React video renderer — the work spans dozens of pull requests over several weeks. Each PR looks fine in isolation: tests pass, linting succeeds, the diff makes sense. But did you actually wire everything together correctly?

Traditional code review operates at the pull-request boundary. Reviewers see a single file diff, maybe a handful of related files, and make local judgements: does this change work? Single-PR review is poorly suited to catch integration-level issues because the relationships live between projects, not within them. You might implement a multi-platform render trigger in one service but never call it from the job scheduler in another. You might extract production metadata from one JSON path whilst a different code path writes to a different key in the same JSONB column. These aren't bugs in any individual file — they're architectural gaps that only become visible when you audit the entire enhancement as a connected system.

This article describes a methodology for exactly that: using parallel AI research agents to audit complete phases of work across project boundaries. Each agent cross-references plan documents against actual implementations across all repositories touched by that phase, surfacing dead code, missing integrations, and data extraction brittleness that sequential review can easily miss. In one internal audit, the approach found ten real issues — including a fully-implemented feature that was simply never invoked.

The Problem Space: Enhancement Phases Across Project Boundaries

When an enhancement initiative touches multiple repositories, the paper trail often becomes fragmented. A plan document outlines intended changes — add narration support, integrate analytics, extend platform coverage — whilst separate implementation tickets track work in the content management system, the rendering service, and the data pipeline. Each repository evolves independently, merged through separate pull requests reviewed by different engineers.

This distributed delivery model can introduce a class of defects that single-repository review is often poorly placed to catch. Dead code appears when one service implements a method but the calling service never invokes it, often because the integration point sits behind a private keyword or was simply forgotten. Missing integrations occur when a centralised service exists but individual jobs bypass it with ad-hoc implementations, fragmenting configuration and cost tracking. Brittle data extraction logic assumes production metadata lives in one JSON key, unaware that different code paths write to different keys within the same JSONB column.

These issues share a common signature: each individual change appears correct in isolation, but the cross-project contract breaks down. A method signature matches expectations, yet no caller exists. A background job functions perfectly, yet duplicates logic that a shared service already provides.

The Methodology: Parallel AI Agents as Phase Auditors

The audit methodology structures work by enhancement phase, with each phase assigned to a dedicated AI research agent that operates in parallel. Rather than reviewing code sequentially or file-by-file, this approach treats an entire phase—comprising planning documents, implementation commits, and cross-project integration points—as a single unit of analysis.

Each agent's task is explicit: cross-reference the phase plan against actual implementations across all affected repositories. For a narration enhancement spanning a Rails CMS, a Rails API, and a React rendering service, one agent would audit the TTS integration phase whilst another examines the timing alignment phase and a third reviews the rendering pipeline changes.

Phase-level scoping provides the right granularity because enhancement work naturally clusters by concern. In our implementation, a phase typically touched 3–8 files across 1–3 repositories, small enough for an AI agent to hold the entire context, yet large enough to reveal cross-project integration failures. Smaller units (individual files or commits) miss structural issues; larger units (entire epics) exceed context windows and dilute focus.

The parallelisation strategy treats each phase as an independent audit task, with agents working simultaneously rather than sequentially. This surfaces issues that sequential review misses—such as dead code in one service that was meant to be invoked by another.

Cross-Referencing Plans Against Implementations

Each agent operates as a specialised auditor for a single enhancement phase. The agent first parses the plan document to extract intended changes: new API endpoints, configuration flags, integration points, or architectural decisions. It then systematically verifies each intention against the codebase.

In our implementation, each agent had access to: the phase plan document, a compressed archive of relevant source files from all affected repositories, and standard search tooling (grep, file listing, content inspection). Agents produced structured findings as JSON, which were then aggregated and deduplicated. Human validation against the live codebases was the final step before any finding was accepted.

Verification means more than grep searches. For a new multi-platform render trigger, the agent checks:

Whether the method exists (def render_for_all_platforms)
Its access modifier (public, private, protected)
Whether any code actually calls it (via grep -r "render_for_all_platforms")
If configuration wiring exists (environment variables, feature flags)

For a metrics pipeline, verification traces the entire usage path: from the job that triggers collection, through the service that fetches data, to the database column where results land. The agent looks for broken links — a method defined but never invoked, a service bypassed by ad-hoc implementations, or data stored in unexpected JSONB keys.

In our case study, the cross-referencing process caught 10 issues across five phases: dead code living below private keywords, batch jobs using raw API clients instead of centralised services, and brittle JSON extraction assuming a single storage location when multiple paths existed. Each finding represented a disconnect between planned architecture and actual implementation.

What the Agents Found: Real Issues Surfaced by This Approach

The audit methodology revealed three significant defect classes that highlight the value of cross-project analysis. These weren't isolated bugs—they represented systemic patterns that single-repository reviews rarely catch.

Dead code with no valid call path emerged when a multi-platform render trigger was implemented in the rendering service but never wired into the post-processing job that should have invoked it. The private access modifier was a clue, but the real issue was reachability: no code path called the method at all, meaning an entire feature sat unused in production. The fix required making the method accessible and adding the integration call—but only for universal renders with multiple accounts, not platform-specific ones.

Centralised services bypassed by ad-hoc implementations surfaced when a batch rating job used a raw API client with hardcoded model strings instead of routing through the shared LLM service. This meant model updates wouldn't propagate and cost tracking remained inconsistent. Extending the centralised service with vision analysis capabilities brought the rogue caller back into the fold.

Data extraction brittleness from schema drift appeared when analytics metadata extraction assumed a single JSON path for flags like narration enablement. Different code paths stored identical data in different JSONB keys within the same column. The solution introduced fallback logic checking both locations—defensive programming that prior reviews had missed.

False Positives Require Human Review

Of the findings surfaced by the AI agents, three were false positives that were filtered out during human review before the final report was produced. Agents misinterpreted intentional duplication — such as platform profiles that existed in both projects by design, not by neglect — as redundant or dead code.

False positives are particularly common when agents lack domain context about feature flags or deployment timing. Human validation against the actual codebases remains non-negotiable: it reduced 12 raw findings to the 10 validated issues that formed the actionable report.

Synthesising Agent Findings into Actionable Results

Once each agent completes its audit phase, the real work begins: synthesising parallel findings into a coherent action plan. In our case study, each agent emitted a structured JSON report containing issue descriptions, severity levels, affected file paths, and cross-references to plan documents. A simple Python aggregator collected these five reports and performed deduplication based on file path and issue fingerprint (a hash of the problem description). This reduced 12 raw findings to 10 unique issues.

Prioritisation followed a two-pass approach:

Severity triage — Critical bugs (broken integrations, dead code) ranked above quality improvements (missing constants, inconsistent patterns)
Cross-phase dependencies — Issues where one agent's finding referenced another agent's scope were flagged for reconciliation

For example, the "multi-platform render trigger" finding spanned two agents' domains: one identified the unused method (Phase 2: rendering service), whilst another noted the missing integration point (Phase 3: post-processing pipeline). The aggregator automatically linked these via file path overlap and flagged them for joint resolution.

Human review remains non-negotiable. Agents surface candidates; humans decide which findings are real and which reflect intentional design choices.

The final audit report grouped findings by service, with cross-cutting issues highlighted separately. This structure let us fix all 10 validated issues in a single focused session rather than context-switching between disconnected problems.

Keep Plans Concise

Plan documents should separate intent from implementation details. List expected integration points explicitly (“Service A should call Service B’s new endpoint after processing”) to give agents concrete checkpoints rather than vague descriptions. Brevity matters: verbose specifications cause agents to miss critical details or hallucinate connections between unrelated components.

Practical Considerations and Limitations

This methodology carries significant context window constraints. Even with context windows now reaching 1 million tokens on top-tier models, five simultaneous research agents produce substantial combined outputs. Smaller, more cost-effective models still have tighter limits, and in practice the bottleneck is not just one model’s raw context window — it’s also packing repositories, token cost, orchestration overhead, and the fact that not every deployment surface exposes the maximum context. Plan document brevity remains important regardless. We found that plan documents must be concise — verbose specifications cause agents to miss critical details or hallucinate connections between unrelated components.

Audit quality depends heavily on the clarity of plan documents. Agents can only audit what was specified. If a requirement wasn't documented (or was documented ambiguously), the audit will either miss the implementation gap or flag correct code as incorrect. We experienced this with the dead code issue: the plan stated a method should exist, but didn't specify where it should be called, leading to working-but-unused code.

The methodology excels at structural checks but requires human judgement to distinguish genuine defects from intentional patterns.

The methodology excels at structural and integration checks but cannot assess runtime behaviour. Performance regressions, race conditions, and production data patterns remain invisible. The metrics snapshot approach revealed data extraction brittleness only because the plan explicitly listed expected data paths — runtime testing would have caught this faster.

When simpler approaches suffice: For single-service changes or well-tested codebases with comprehensive integration tests, traditional PR review is more efficient. In our experience, this methodology justified its overhead when enhancement work spanned 3+ repositories, involves multiple teams, or occurs over weeks where context fragments. For our five-phase audit, the 10 issues found in our case study (including production correctness gaps) validated the effort.

Adapting the Methodology to Your Own Codebase

Adopting this methodology doesn't require immediate buy-in across your entire organisation. Start with a lightweight experiment on a recent enhancement that touched 2–3 services. Create a simple plan document (Markdown works fine) that lists what changed, where, and why. Include concrete references: file paths, method names, database columns.

For the audit itself, use a research-capable model with strong long-context performance (at time of writing, models like Claude Sonnet 4.6 and GPT-5.4 support roughly 1M-token contexts via their APIs, though availability varies by surface). Provide each agent with:

The plan document for one phase
A compressed archive of relevant source files
A clear prompt: "Cross-reference this plan against the implementation. Flag missing integrations, unused code, and deviations from the plan."

Structure your plan documents for auditability by separating intent from implementation details. List expected integration points explicitly ("Service A should call Service B's new endpoint after processing"). This gives agents concrete checkpoints rather than vague feature descriptions.

Define phase boundaries around natural integration seams — where one service hands off to another, or where shared data formats are established. Avoid phases that span too many concerns; in our experience, 3–5 changes per phase proved manageable.

Iterate on prompts by running an audit, reviewing flagged issues, then refining the prompt to reduce false positives. In our experience, after 2–3 iterations we had a reusable template for future enhancements.

If your codebase uses private methods extensively, explicitly ask agents to verify whether new code is actually reachable. The dead code issue in the source material lived below a private keyword and was never invoked — a pattern that sequential review often misses.

Frequently Asked Questions About Multi-Service AI Auditing

When should I use this parallel agent methodology instead of traditional code review?

This methodology justifies its overhead when enhancement work spans 3 or more repositories, involves multiple teams, or occurs over weeks where context fragments. For single-service changes or well-tested codebases with comprehensive integration tests, traditional PR review is more efficient.

Can this approach catch runtime issues like performance regressions or race conditions?

No. The methodology excels at structural and integration checks—such as dead code, missing integrations, and data extraction brittleness—but cannot assess runtime behaviour. Performance regressions, race conditions, and production data patterns remain invisible to this approach.

What's the minimum viable way to experiment with this methodology?

Start with a recent enhancement that touched 2–3 services. Create a simple Markdown plan document listing what changed, where, and why, including concrete references like file paths, method names, and database columns. Then assign one research-capable AI agent to cross-reference that single phase plan against the actual implementations.

How many iterations does it take to develop reusable audit prompts?

After 2–3 refinement cycles—where you run an audit, review the flagged issues, and adjust the prompt to reduce false positives—you'll typically have a reusable template for future enhancements.

How does the methodology handle false positives from the AI agents?

Human review is non-negotiable. In the case study described, a human validator examined each finding against actual codebases and filtered out three false positives where agents misinterpreted intentional duplication as defects. False positives are especially common when agents lack domain context about deployment timing or feature flags.

What kinds of issues did this approach actually find that traditional reviews missed?

The audit surfaced three major defect classes across 10 validated issues: dead code hidden behind private access modifiers where a fully implemented feature was never invoked, centralised services bypassed by ad-hoc implementations using hardcoded API clients, and brittle data extraction logic that assumed a single JSON path when multiple code paths wrote to different keys in the same JSONB column.

Conclusion: Auditing as a Missing Practice in Multi-Service Development

Multi-service architectures have fundamentally changed how we build software, yet our auditing practices haven't caught up. When enhancement work spans three codebases and two language ecosystems, traditional code review — focused on individual pull requests within project boundaries — often misses cross-project issues. A method can be implemented but never called. A centralised service can be bypassed by direct API access. Platform-specific logic can live in the wrong layer entirely.

The parallel agent methodology described here demonstrates one approach to filling this gap. In our case study, five agents auditing five enhancement phases simultaneously surfaced 10 issues that sequential review had missed. But the structured cross-referencing itself — systematically verifying that documented intent matches actual implementation across all affected services — is the real contribution. Whether performed by AI agents, human reviewers with checklists, or automated tooling doesn't fundamentally matter.

The methodology is the insight: multi-service work requires multi-service auditing.

As architectures grow more distributed, the gap between what we plan and what we implement across project boundaries will only widen. Treating cross-project auditing as a distinct practice — with dedicated time, tooling, and methodology — may become an important complement to existing review and testing practices for multi-service development.