Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork

Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork Lauren Clayberg Leidal February 25, 2026

April 14, 2026 · 11 min read · Testing Guide

Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork

Lauren Clayberg Leidal

February 25, 2026

We refactored mabl 's test creation agent after nine month in production. But we did n't rebuild on gut tone. We analyzed 2,404 existent customer sessions with an AI-powered review agent to mensurate exactly what was breaking and what to fix. Here 's what production data at scale teaches you about building agent that actually work.

The hardest constituent of building production AI agents is n't shipping them; it 's understanding whether they are getting smarter as they get used.

At mabl, we shipped our 1st agentic trial conception scheme in spring 2025, good before most examination tools had AI on their roadmap. After nine month in production, we rebuild it from the reason up. It wasn ’ t broken (in fact, customers were using it with great success), but the original architecture was ready for an evolution.

The decision to do a refactor was n't based on anecdotal feedback or national intuition. We built an AI-powered follow-up agent to analyze 2,404 real customer sessions across 439 accounts, measuring behavioural character that traditional metrics ca n't capture: looping shape, error recovery, assertion character, decision-making consistence. The data told us exactly what to reconstruct and, once we shipped the updates, evidence the refactor worked.

What Production Usage Taught Us

Our first-generation agent worked. It could construe screenshots, decide on next steps, interact with page elements, and generate assertions. Customers were creating tests with it daily.

But `` working '' and `` working well at scale '' aren ’ t always the same thing. As usage grew, we started witness sure patterns:

Sessions where the agent intertwine through the same failed action repeatedly
Click interactions that should have succeeded but did n't, with no recovery path
Tests that worked in demos but struggled with existent enterprise UIs where you see nested iframes, custom components, and legacy authentication flows

Traditional metrics like “ exam save rate ” could n't tell uswhythese patterns issue orhowto fix them. The behavioral analysis we were able to glean from the production usage have the key to those questions.

The V1 Architecture: Built for a Different Era

The original agent was built around a single-shot pattern: guest sends state, server return action, client executes, repeat. Every API call was stateless. The poser never saw its own history. It had limited remembering of what it had tried, what failed, or what it had already considered.

This was the pragmatic determination for early 2025, but as model capableness improved and product usage revealed edge cases, the constraints started to do pains in the testing process. The monolithic design mean that adding new context or tool would take changes across multiple file. Meanwhile, stateless executing meant the agent approached every decision in isolation, unable to learn from its own mistakes within a single session.

The Refactored Architecture: Composable by Design

The new system replaced the monolith with a composable fabric make around a few core principles.

A Shared Agent Framework

The test conception agent is much smaller and acts as more of a thin stratum on top of a shared base framework that handles the complete agent lifecycle: initializing setting, file tools, generating prompts, invoking the framework through multi-round function calling loops, and returning outcome. A new agent but ask to define its system prompting, nonremittal capabilities, and any custom logic. Everything else is inherited.

This is n't exactly cleaner code. It 's a bet on the future. We know we 'll need agents for tryout debugging, maintenance, and provision. The model means each new agent is a shape bed sooner than another monolith to maintain.

Artifacts Replace Monolithic State

The sprawling state object was supercede with focussed, schema-validatedartifacts. These are self-contained piece of context like the test synopsis, execution province, application state (screenshot, URL, check), and workspace context. Each artefact knows how to describe itself for the system prompt and how to contribute to the overall circumstance.

This makes prompt building automatic. When a new artifact is registered, the prompting updates to include it. Schema proof catches poorly-formed circumstance at runtime, avoiding mysterious model behavior downstream.

Provider-Agnostic LLM Integration

We replaced direct SDK calls with a provider abstraction. Each provider implements a small interface – convert the request, generate a response, convert back to a common message format. A factory wraps these into a unified supplier with automatic substance normalization, built-in function calling loops, teem support, and retry logic.

Adding a new LLM provider imply implementing a handful of functions. Switching between providers is a routing decision, not a rewrite. This has already paid off: we can test the same agent doings across different models to assess which do better for specific sub-tasks.

Voguish Tool Execution

The original system ran every tool on the node. The new architecture separates client-side and server-side creature execution, which unlock significant potentiality.

Server-side tools (like fetch available flows or analyzing workspace context) resolve without any customer round-trip. Client-side tools handle actions that must befall in the browser: dog, text entry, waits. This separation allows us to build intelligence into agents and share it across different agent case without customers having to update their node.

The practical wallop goes beyond performance. We 've built a server-side library of interior puppet that get agents more aware of mabl-specific data and patterns. When we ameliorate agent intelligence or add new capabilities, those upgrades happen straightaway on the server. The agent acquire smarter without any client-side deployment.

Conversation Memory as Architecture

The almost consequential alteration is that the agent now maintains conversation history. Instead of treating each activeness as an isolated request, the model can see what it 's tried, what work, and what miscarry.

This architectural shift directly speak the looping problem we saw in production information. When the agent clicks an element and nothing happens, it can now reason, `` I already assay this approaching; let me try a different strategy, '' rather than repeating the like failed activity over and over. The conversation becomes the state that enables learning within a session.

For autonomous testing across multiple user personas, check out SUSATest — it explores your app like 10 different real users.

Measuring What Matters: The Review Agent Framework

Most vendors ca n't tell you how their agents behave within sessions: Are they making full initial conclusion? Are they recovering from error effectively? Are the yield high-quality? Instead, they continue to ship lineament and, at best, measure usage.

We needed more than that. Traditional metrics hide the behavioral patterns that actually weigh for agent quality. We wanted a system that would let us continuously measure and improve agent behavior at scale, so we built a review agent. This AI system is project to analyze agent sessions with the like rigor a human expert would apply, but across thousands of sessions simultaneously.

How the Review Agent Works

We designed a integrated analysis framework that valuate each test creation session across multiple attribute:

Click and element interaction failures:Did the agent successfully interact with page elements, or did it struggle with target or execution?
Delete step usage and correctness:When the agent take measure, was it making good corrections or creating new problems?
Assertion quality:Are averment meaningful verifications, or overly specific assay draw to implementation details?
Looping behavior:Is the agent making productive retries with different strategies, or unprofitably spiraling through the same betray attack?
Missing capabilities:Are there patterns where the agent lacks the tools it needs to complete the project?

We ran this analysis across hundreds of sessions from both the old and refactored agent, excluding intragroup usage to ensure upshot reflected real customer shape.

What the Numbers Revealed

The data told an exciting story: clean session (where a test was saved successfully with no looping or interaction issues) increased roughly4.5x; severely problematic session (where issues are compounding across looping, click failure, and problematic deletes) dropped by over90%.

Even more interesting is what we saw in the behavioral practice:

Looping became predictive:In the old agent, intertwine occurred at roughly the same rate regardless of whether the test was ultimately saved. In the refactored agent, saved tests showed meaningfully fewer loops than unsaved tests. The agent was nowrecoverfrom number rather than getting stuck in them.

Click issues turn correlate with outcome:The old agent had similar click issue rates whether the examination was saved or not. It could n't severalise between interaction problems that were recoverable and ones that were disastrous. The refactored agent exhibit a open gap between saved and unsaved sessions, designate well error recovery.

Assertions got smarter, not exactly few:Assertions per session dropped importantly, but assertion quality improve by several percentage points. The agent ’ s confirmation got more focused and meaningful, alternatively of spamming the assertions to find what sticks. Specific categories of bad assertions saw dramatic improvements–errors where the agent confused an constituent 's display text with its underlying valuedropped by over 80 %.

Delete behavior told a floor about decision calibre:The old agent used delete steps in over half of sessions, and more than half of those deletes caused more trouble. The refactored agent take deletes far less often, and when it did delete, the vast majority be thing that be correct in being edit. The agent was simply making better initial decisions.

This level of behavioural analysis is n't standard drill in AI development, and it 's impossible without integrated evaluation at scale. With our baseline behavioural metrics from thousands of session, we cognize that when architectural alteration happen, we do n't receive to guess at whether they worked. We quantify agent behavior the same way we 'd measure any production system: with data, not exhibit.

Why Traditional Metrics Can Mislead

One metric initially looked concerning: test save pace dip after the refactor. The use of the review agent was critical in telling the more consummate story.

Sessions with zero assertions increased substantially. These were incomplete or abandoned session where users withdraw before reaching the assertion form; these were not agent failures. Meanwhile,among session that build to assertions, save rate meliorate and assertion calibre increase significantly.

This is a outstanding instance of why behavioral analysis matters–simplified, aggregative outcome data doesn ’ t direct user engagement form into consideration. Understandinghowthe agent performs within completed session gives you the signal you really demand to better the system.

Identifying What to Build Next

The followup agent besides gave us a prioritized roadmap. In appear at session where the agent lacked the tools it needed, we were able to see the spread and build fixes for them: tab switching capabilities (so agent can handle multi-tab workflows), access to ARIA snapshots for best availability tree sympathy, and improved constituent interaction with the experimental model options in mabl Labs all come to us from these session.

These are n't guesses at what customer want; they come from a integrated analysis of real session failure, and ranked in order of grandness by looking at the routine of session impacted by the lack of tooling. The reappraisal agent continues to surface new potentiality opening as we analyze more sessions, afford us a data-driven reserve.

What We Learned

Tech debt in AI compounds quicker than traditional software.The patterns and capableness available to agent builders are evolving on a timeline of month. An architecture that was pragmatic in outpouring 2025 was constraining by early 2026. Planning for refactoring is n't a sign of poor initial pattern. It 's a acknowledgment of how fast the earth is shifting.

Agent observability necessitate more than metrics fascia.Session-level success/failure rates hide the behavioral patterns that actually matter: whether the agent is making full initial decisions, recovering from mistake effectively, and generating quality outputs. We institute that a structured AI-powered revaluation process revealed insights that no amount of aggregate data could surface.

Composability is the hedging against uncertainty.We do n't know what the better model for each agent sub-task will be in six month, or what new capableness we 'll want to add. The refactored architecture (with its provider abstraction, artifact scheme, and shared fabric) means we can adapt to whatever comes next without another ground-up rewrite.

Measure behavior, not just outcomes.The refactored agent 's test save rate dipped slightly, which would have been alarming in isolation. Understanding thewhythrough nuanced session analysis revealed that the actual agent behavior improve dramatically across every quality property. The outcome datum made more sentiency when accounting for the behavioural implications.

Why This Refactor Was an Investment, Not a Reset

In traditional software, a nine-month rebuild signals architectural failure. In AI agent development in 2025, it signals discipline.

The poser capabilities, role calling design, and production encyclopedism uncommitted today did n't exist when we shipped V1. Companies that are n't refactoring their early AI systems are either not learning from production usage, or they 're locked into architecture that ca n't evolve, and both of those are problems.

The composable framework, provider abstraction, and artifact scheme mean future improvements can be additive: new agents, new tools, new models, all without stir core architecture. This refactor bought us an architectural runway that should carry us through the next phase of AI phylogenesis.

More importantly, we now have the measurement infrastructure to validate those melioration. The review agent framework is n't going away; it will continue to be how we evaluate every architectural change going forward.

What This Means for Testing Strategy

If you 're evaluating AI-powered examination tools, the refactor story should actually be reassuring. It mean we 're learning in product, measuring strictly, and unforced to invest in long-term architecture.

Ask any AI testing vendor these enquiry:

Can they explicatehowthey measure agent quality beyond pass/fail rates?
Do they have production behavioral data at scale, or just present videos?
Is their architecture locked to a individual model provider?
Can they show you structured failure mode analysis?

For mabl, the resolution are yes. We built the mensuration systems to make them yes.

The critique agent model continues to evaluate new production sessions as they happen, analyzing behavioral patterns to read how the agent performs. This informs our architectural decisions without training on client data - we measure agent behavior, not test message.

The composable architecture means we can adapt to new model capabilities without another rebuild. Together, these capability - strict measuring and architectural flexibility - are the foundation for building AI agent that get measurably better over time, not just different.

The AI agent space is travel tight enough that the systems we construct today will want to evolve tomorrow. The goal is n't to make the perfect architecture on the first try. It 's to build one that can grow, and to have the observability to cognise whether the modification you 're making are really working. For mabl 's test conception agent, the combination of a composable architecture and AI-powered behavioural analysis gave us both.

Try mabl Free for 14 Days!

Our AI-powered testing platform can transform your package character, mix automated end-to-end testing into the entire development lifecycle.

Quality Engineering Resources

Automate This With SUSA

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed.

Try SUSA Free

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free

Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork

Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork

What Production Usage Taught Us

The V1 Architecture: Built for a Different Era

The Refactored Architecture: Composable by Design

A Shared Agent Framework

Artifacts Replace Monolithic State

Provider-Agnostic LLM Integration

Voguish Tool Execution

Conversation Memory as Architecture

Measuring What Matters: The Review Agent Framework

How the Review Agent Works

What the Numbers Revealed

Why Traditional Metrics Can Mislead

Identifying What to Build Next

What We Learned

Why This Refactor Was an Investment, Not a Reset

What This Means for Testing Strategy

Try mabl Free for 14 Days!

Quality Engineering Resources

Automate This With SUSA

Test Your App Autonomously

Related Articles