How We Built a System for AI Agents to Ship Real Code Across 75+ Repos [Part 1 of 2]

How We Built a System for AI Agents to Ship Real Code Across 75+ Repos [Part 1 of 2] Geoff Cooney April 8, 2026

April 16, 2026 · 11 min read · Testing Guide

How We Built a System for AI Agents to Ship Real Code Across 75+ Repos [Part 1 of 2]

Geoff Cooney
April 8, 2026

Key Insight: Individual developers habituate AI agents see 2-5x profit, but scaling across squad requires infrastructure. mabl 's technology team built a four-layer system: per-repo operating manuals, validation line, partake agent/human puppet, and layer governance where AI blocks bad codification and humans always okay merges. This architecture resulted in impressive gains in efficiency and is offered as a reference for squad looking to operationalizing agentic development at product scale.

Coming into 2026, we at mabl could sense a transmutation bechance. Our engineers were seeing efficiency improvements by leveraging agentic coding tools. A handful of our engineers with deep understanding of mabl systems were seeing still more dramatic gains. PMs and former stakeholder were starting to be able to contribute to the mabl codebase. Features that followed existing figure could be done almost entirely hands-off.

Our experience aligned with industry trends. Anthropic published a report in early December that showed employees who were using Claude on 60 % of their work were too describe getting productivity gains of 50 %. At the same time, Claude users reported only being able to fully delegate less than 20 % of their work to LLMs.

The Problem We Hit: Scaling Agentic Development Beyond Individual Contributors

mabl has 25 technologist cope 100+ repositories. The team push 200+ PRs and 40+ product releases a month. This was before speedup from agentic evolution.

As we looked into increase our own adoption of agentic package development process, there were some key scaling challenge:

Cross-repo coordination.When a feature spans three deposit (an API modification, a frontend update, and a shared library), how does an agent know which repos depend on each other? How does it ascertain merge order so the API does n't break when the library ships first? Who maintains the habituation graph, and how do agents access it? This is compounded by the fact that most of the survive out-of-the-box tooling in this area assume you are start from a individual mono-repo.

Development is a lot more than writing code.mabl developers interact with customer support on Jira tickets, build the app locally to formalize their changes, check PRs for feedback, and much more. These activities are even more important for agent as we scale because they ground the code change in measurable behaviors.

Human codification review at AI speed.If AI increases coding throughput by 3-5x across our squad, human review becomes the constraint. We could n't have engineers reviewing 300 PRs/week when many were low-risk changes (dependency bumps, lint fixing, test updates). We needed a way to let AI catch obvious issues before human reviewers spent clip on them, freeing our developers to focus on the PRs that really demand human judgment.
Operations at scale. An increase in PRs by an order of magnitude means we hit scaling and coordination challenges in our build and unloosen pipelines much faster than before.

Quality needs to locomote both up and down the stack.Increased velocity combined with coding and PR reexamine moving to higher-level specifications means we necessitate to rely even more heavily on our automation. At the same clip, enabling aim local establishment allows the coding agents to better reason themselves.

By mid-2025, we were at a crossroads.In & nbsp; August 2025, about 10 % of our commits were AI-assisted, entirely from individual contributors using tools ad-hoc. By February 2026, that number hit 39 % overall and 60 % in infrastructure repos.These number don ’ t yet include code that was co-written by codification assistants but perpetrate by developers.

The challenge was n't getting agents to write code. It was make the infrastructure to let 25 technologist use agents safely across 75+ repos.

Here 's the architecture we built.

The Architecture

The architecture we built was & nbsp; an evolution, contrive layer by layer as we encounter clash points, resulting in discrete components that turn agent into autonomous contributors in our ecosystem:

Cross Repo Base (Rules + Dependency Graphs)— A pseudo mono-repo that sits above all our individual repos. Gives agents the same context any developer at mabl would get: repo convention, dependencies, architectural patterns, protection constraints. No tribal knowledge bottleneck.

Skill System (MCP Servers + Reusable Skills)— Agents and humans get identical capabilities: query Jira tickets, run tests, verify UI changes visually, and manage environments from one unified interface.

Operations & amp; Governance— AI codification review block problematic PRs but explicitly can not sanction them. Auto-fix agents handle trivial CI failures with circuit breakers to forestall runaway loops. Humans incessantly make the final merge decision.

Then we brought this all together into an automated workflow that follows a 3 stage state machine to plan, plan, and implement individual tickets. This leverage the same rules and acquirement an technologist working with Claude does.

Layer 1: Cross Repo Base

The friction we hit: Agents do n't receive tribal knowledge. Especially on an established code groundwork they oftentimes get locally optimize but architecturally pitiful plan decisions. Tell Claude Code to create a new report in the web app and it will happily hack something together oblivious to all the API termination and back-end PDF generation services it can leverage. That ’ s a relatively simple example. We were spending more clip render context and decline bad approaches in PR than we salvage from agent-generated code.
What we built: A centralised “ cross repo base ” repository that sits at the root of all our downstream repos.

The repo base contain a set of common rules and skills:

  • Engineering guide

  • Java and typescript general best practices

  • Design patterns

  • Standard workflow

  • Google Cloud and internal tooling

  • A variety of skill we will detail in the next subdivision

Every repo conserve its own CLAUDE.md file containing repo specific rules and exclusion. The CLAUDE.mds are of alter complexness depending on repo complexity, orbit, and variation from standards:

Repo purpose: What this repo does, how it fits into our scheme
Architecture: What are the core components of this repo
Dependencies: Which repos import this one (or are imported by it). Example: our shared-node-utils is consume by API, UI, and mabl-cli. Any breaking change command organize PRs.
Development convention: Commit flags ([skip ci], [bump: patch]), Terraform patterns, exam sharding config, establish commands
Step-by-step workflows: `` How to add a new API endpoint '' as a 10-step guide agent can follow

Alongside this, we maintain aRepo Coordination Graph, a central register mapping dependencies across all repos in our GitHub org. At 850+ lines, it extend 79 repositories with detailed addiction graphs, Pub/Sub matter maps, database table ownership, and order release ordering. When an agent work on a cross-repo feature, it queries this graph to determine:

  • Which repos motive updates

  • Merge order (upstream libraries before consumers)

  • Which reviewer to tag found on CODEOWNERS and dependence chains

  • Pub/Sub topic map and database table possession

We use Git Worktreesfor isolated parallel execution. If an agent works on three repos simultaneously, each gets its own worktree so there 's no province contamination risk. We builtworktree-related hooksthat actively prevent agent from editing files in the wrong workspace. If an agent is operating in a worktree and tries to redact a file in the main workspace (or a different worktree), the hook blockade the edit and suggests the correct path. This solves a real failure mode in multi-repo agent systems: unexpectedly cut the wrong copy of a file.

We likewise enforcea mandatory code grounding rulethat need agent to control all file paths and code credit before refer them. No fabricated references allowed. Agents must use Glob or Read to affirm a file exists and cite specific line figure from codification they 've actually say. If code is unaccessible, the agent must province that clearly rather than guessing.

What this solved: Without the ecosystem setting, every agent invocation required a human to cater context (`` this repo use Terraform faculty nominate pattern X, look on Y, and PR reviews ask Z ''). With it, agent operate with the like circumstance a elderly engineer at mabl would have. Context drift (where an agent loses track of conventions mid-task) dropped from ~40 % of our failure to & lt; 5 %.

SUSA automates exploratory testing with persona-driven behavior, catching bugs that scripted automation misses.

Layer 2: Skill System (MCP Servers as Shared Building Blocks)

The friction we hit: Developers naturally incorporate half-baked amounts of context when work on any individual bug or characteristic. Without accession to the like tools and capabilities that developers use, Claude was much working off of speculation base solely on code analysis. Developers will say JIRA ticket point and screenshots, launch the app, work in Slack, and use internal tools to check data. Without those same skills we be handicapping Claude.

What we establish: We integrated a handful of MCP servers and close to 40 unique skills that permit Claude to investigate the way a developer would investigate.

A few key servers that give agents and humans equivalent capabilities:

  1. Atlassian MCP Server— Agents and developers can query ticket, update statuses, add comments, search Confluence, and transition label all from Claude Code. An agent act on a ticket can mechanically move it from planning-needed to implementation-needed when it submits a PR.

  2. mabl Testing MCP Server— Agents can trigger mabl exam run, parse results, and determine if a change broke existing tests. When an auto-fix agent encounters a CI failure, it queries this waiter to see if it 's a freaky examination (ignore) or a real fixation (fix). The server also handle test environments and creates new mabl tests.

  3. Desktop Automation MCP Server— Agents can establish the mabl Desktop Trainer (our Electron app), execute UI interactions via Playwright, and capture screenshots. This solved the `` Teaching Agents to See '' trouble for us. When the validation router mold that desktop changes were create, the implementer agent can visually verify the resultant using this MCP server, not precisely run unit trial.

These are configured at the workspace level using.mcp.json, so every developer using Claude Code or Cursor gets the same tool admission out of the box.

Beyond the three core MCP servers, we 've construct 36+ reusable Skills, specialize knowledge module that agents and developers stir via/command:

  • Workflow skills: Worktree management, repo setup, build-and-wait-for-dependency, deploy previews, changeset creation

  • Cross repo shape science: Run the api+web ui+desktop app all connected locally

  • Validation skills: Per-repo tryout contrabandist (Java/Gradle, React/Vitest, TypeScript/Jest, Playwright/Electron), E2E test execution via mabl MCP, a validation router that automatically selects the correct test suite based on which files changed

  • Analysis & amp; Planning skills: Type-specific analysis (bug/feature/debt), type-specific planning, trouble approximation, ticket categorization

  • Operations skills: Rerun tests with custom runtime images, fetch execution logs, make mabl exam

Beyond the initial substructure, we ’ ve make a baseline that all developers can continue contributing new skills and rules to making us organizationally smarter every day.

What this solved: MCP waiter and accomplishment turned our agents from `` codification generators '' into `` full-stack developers. '' An agent can now read a Jira ticket, generate an implementation design, write codification, run tests, verify UI alteration visually (when the establishment router determines desktop changes were made), submit a PR, and move the ticket to review-needed, all without human intervention. The human enrol the loop at the authority gate between form and always for the final PR approval.

Layer 3: Operations and Governance & nbsp;

The detrition we hit: With AI increasing our coding throughput by 3-5x, we ask to scale quality enforcement without create human review the constriction. More PRs also means more conflicts and engineers be spending more time than ever fixing build issues, resolving merge conflicts, shepherding builds through pipelines.

What we make: We ruthlessly hunted down sources of conflict while incorporating agents into our PR review and operations summons.

AI Code Review

Every PR get an AI codification review via ourclaude-pr-reviewworkflow. We force in the relevant common rules from cross-repo-base as well as repo specificCLAUDE.mdfiles, then render the review agent with denotative counselling on illustration of blocking issues around security, correctness, and code standards. The reviewer can:

  • REQUEST_CHANGES(blocking): The PR can not merge until the subject are speak

  • COMMENT(non-blocking): Suggestions that do n't block the merge

Auto-Fix Agents

When CI fails on a lineament ramification with an unfastened PR, our auto-fix workflow name the failure and pushes a fix commit. We kept scope small to begin rivet on unproblematic fixes like linter errors, lose signification, simple eccentric errors, idle varying warnings.

We explicitly learn the agent to debar known thornier issue that require more judgement and oversight such as test assertion failures, security vulnerabilities, missing dependencies, build configuration errors, and complex bugs requiring architectural changes.

The workflow likewise has circuit breakers to preclude fix-fail loops and battle between developers and the motorcar fix agent. For exemplar, If the last 2 commits are both tagged[auto-fix], the workflow stops. Or if a PR author previously reverted an auto-fix commit, the workflow wo n't attempt the like fix again. It recognizes that the author made a calculated decision that the issue ask human attention.

PR Slash Commands (ChatOps)

A command handler workflow enables ChatOps-style commands via PR scuttlebutt. Developers can trigger specific actions by remark on a PR:

  • /claude-review and @ claude— Trigger AI code review (available across all repos)

  • Repo-specific commands:/run-integration-tests, /run-all-tests, /deploy-preview, /run-functional-tests, /run-playwright-builds, /build-images, /run-browser-tests

Commands are acknowledge with emoji reactions and status comments. This gives developers fine-grained control over what runs without editing workflow files.

Bringing it together

We integrated this with GitHub 's native merge queue and Terraform-managed rulesets:

  • Org-level rules: One required human approval, squash-only merges, no force pushes, no branch deletion

  • Repo-level formula: Required status checks (progress, examination, security scanning), configurable concurrency limits

The flow for a distinctive agent-generated PR:

  1. Agent make PR after passing pre-push validation gates

  2. AI code review runs mechanically, can REQUEST_CHANGES or COMMENT

  3. If CI fails on lint/type issues, auto-fix agent try repair (with circuit ledgeman)

  4. Human reviewer evaluates the PR, informed by AI review gossip

  5. Human approves (or requests change)

  6. PR enters merge queue; full test suite runs

  7. PR merges

What this resolve: In February 2026, we 're pacing at 370+ PRs/month in application repos unaccompanied (up from 230/month in August 2025). The AI code review catch issues before human reviewers spend clip on them. The auto-fix agent resolves trivial CI failure (missing semicolons, unused imports, type mismatches) without human interposition. Human reviewers concentrate on the PRs that authentically need judgment, like architectural decisions, security implications, and complex logic.

That 's the infrastructure: shared context, share tools, and administration that keeps humans in control without do them the chokepoint. But infrastructure is only utilitarian if agents can actually navigate it.

In Part 2, we 'll walk through how we put all three layer to work.& nbsp; A four-phase pipeline that direct a Jira tag from concept to merged PR, with confidence-based gating that determines when agents proceed autonomously and when they stop to ask for help. We 'll also portion the results, and where agents still break. & nbsp;

Try mabl Free for 14 Days!

Our AI-powered examine platform can transform your software lineament, mix automatise end-to-end examine into the entire evolution lifecycle.

Quality Engineering Resources

Automate This With SUSA

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed.

Try SUSA Free

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free