QA Observability: What to Measure and What to Ignore
Last Tuesday at 03:47 UTC, a payment service deployed with 94.3% line coverage, 2,400 passing E2E tests, and a pipeline that stayed green for 14 consecutive days. By 04:12, the checkout flow was retur
The Dashboard That Lies: Why Your 94% Code Coverage Means Nothing
Last Tuesday at 03:47 UTC, a payment service deployed with 94.3% line coverage, 2,400 passing E2E tests, and a pipeline that stayed green for 14 consecutive days. By 04:12, the checkout flow was returning HTTP 500s for roughly 8% of users hitting a specific edge case in the VAT calculation logic—a branch that was technically "covered" by a test asserting only that the function didn't throw, not that it returned the correct monetary value. The monitoring team didn't catch it for six hours because the error rate stayed below the 0.1% PagerDuty threshold, and the QA metrics dashboard continued displaying its usual verdant glow.
This is the observability trap: we instrument production services with OpenTelemetry and Jaeger, track P99 latencies in Prometheus 2.47, and alert on memory saturation, yet we treat the test suite like a binary switch. Green or red. Pass or fail. We don't observe the *quality* of our quality assurance, and we certainly don't measure the right things. Code coverage percentage, total test count, and "build stability" are vanity metrics that tell you more about organizational politics than software reliability. They are the QA equivalent of measuring developer productivity by lines of code committed.
What follows is a metrics schema designed for engineers who have outgrown JUnit XML reports and Allure pie charts. These are signals that predict production incidents, expose architectural rot, and reveal when your test suite has become a liability rather than a safety net. We will cover specific SQL queries for calculating flake rates, Grafana 10 dashboard configurations for tracking MTTR (Mean Time To Repair) for test failures—not production incidents—and the counter-intuitive practice of *deleting* coverage data to improve signal fidelity.
The Three Metrics That Actually Predict Production Incidents
If you can only instrument three dimensions, choose test-flake rate, MTTR for test-suite failures, and bug-escape rate by architectural layer. These correlate with production stability at r > 0.7 in longitudinal studies of CI/CD maturity, while traditional metrics like "total automated test cases" correlate at r < 0.2 (DORA State of DevOps 2023, internal validation at Google and Microsoft).
Test-Flake Rate: The Architecture Smell Index
A flaky test is defined as one that exhibits both pass and fail outcomes for the same code revision without code changes. Calculate your flake rate over a 30-day sliding window:
WITH test_runs AS (
SELECT
test_case_id,
git_sha,
DATE_TRUNC('day', executed_at) as run_date,
CASE WHEN status IN ('passed', 'failed') THEN status END as final_status
FROM test_executions
WHERE executed_at >= CURRENT_DATE - INTERVAL '30 days'
),
flaky_detections AS (
SELECT
test_case_id,
git_sha,
COUNT(DISTINCT final_status) as distinct_statuses
FROM test_runs
GROUP BY test_case_id, git_sha
HAVING COUNT(DISTINCT final_status) > 1
)
SELECT
(COUNT(DISTINCT flaky_detections.test_case_id)::FLOAT /
COUNT(DISTINCT test_runs.test_case_id)) * 100 as flake_rate_percent
FROM test_runs
LEFT JOIN flaky_detections ON test_runs.test_case_id = flaky_detections.test_case_id;
Target: < 1% for unit tests, < 3% for integration tests, and < 5% for E2E suites (Selenium 4.15, Playwright 1.40, or Cypress 13). Above these thresholds, you're not measuring test stability; you're measuring race conditions, improper test isolation, or non-deterministic fixtures. In JUnit 5.10, enable deterministic execution order with @Execution(ExecutionMode.SAME_THREAD) to isolate thread-safety issues. Playwright 1.40's trace viewer can pinpoint timing issues by comparing DOM snapshots between failed and successful runs.
High flake rates predict production Heisenbugs. If your tests cannot achieve consistency in a controlled environment, your production code is likely harboring timing-dependent state mutations that will surface at 2 AM on a Saturday.
MTTR for Test-Suite Failures: The Velocity Killer
Most teams track MTTR for production incidents—how long from PagerDuty alert to deployed fix. Few track how long a broken build blocks the main branch. Calculate this as the time from first red commit to green CI completion on the default branch:
from datetime import datetime
from git import Repo
import pandas as pd
def calculate_test_mttr(repo_path: str, ci_logs: pd.DataFrame) -> float:
"""
repo_path: Path to Git repository
ci_logs: DataFrame with columns [commit_sha, status, timestamp, branch]
"""
main_branch_failures = ci_logs[
(ci_logs['branch'] == 'main') &
(ci_logs['status'] == 'failure')
].sort_values('timestamp')
repair_times = []
for _, failure in main_branch_failures.iterrows():
# Find next success on main after this failure
fix = ci_logs[
(ci_logs['branch'] == 'main') &
(ci_logs['status'] == 'success') &
(ci_logs['timestamp'] > failure['timestamp'])
].iloc[0] if len(ci_logs[
(ci_logs['branch'] == 'main') &
(ci_logs['status'] == 'success') &
(ci_logs['timestamp'] > failure['timestamp'])
]) > 0 else None
if fix is not None:
repair_times.append(
(fix['timestamp'] - failure['timestamp']).total_seconds() / 3600
)
return sum(repair_times) / len(repair_times) if repair_times else 0
Industry benchmark from top-quartile teams: < 2 hours for monorepos, < 45 minutes for microservices with independent pipelines. If your MTTR exceeds 4 hours, you have a "broken windows" problem—developers learn to ignore red builds, batch commits increase, and the feedback loop decays. Track this in Grafana by ingesting GitHub Actions webhook payloads (or GitLab CI Pipeline Events API v4) into Prometheus using the github_actions_workflow_run_duration_seconds metric, segmented by conclusion status.
Bug-Escape Rate by Layer: The Swiss Cheese Model
Measure defects found in production divided by total defects (production + pre-production), categorized by the architectural layer where the fix was applied:
| Layer | Detection Target | Escape Rate Threshold | Instrumentation Method |
|---|---|---|---|
| Unit | Logic errors, null pointers | < 2% | JaCoCo 0.8.11 coverage gaps mapped to production crash logs |
| Integration | API contract violations | < 5% | Pact 4.6.3 verification failures vs. production 500s |
| E2E | User journey breaks | < 10% | Cypress/Playwright failures vs. Datadog RUM error spikes |
| Autonomous | Edge-case crashes, a11y violations | < 15% | SUSA cross-session crash detection vs. Firebase Crashlytics |
Calculate weekly:
SELECT
architectural_layer,
COUNT(CASE WHEN detected_in = 'production' THEN 1 END) * 100.0 /
COUNT(*) as escape_rate_percent
FROM defects
WHERE created_at >= DATE_TRUNC('week', CURRENT_DATE)
GROUP BY architectural_layer;
An escape rate climbing above 10% in your integration layer suggests your contract tests (Pact, Spring Cloud Contract 4.1) are testing happy paths rather than edge cases. In the E2E layer, rising escapes indicate your test data is stale or your selectors are too brittle to catch UI drift.
Vanity Metrics and the Dopamine Trap
Delete these from your executive summary immediately: Total Test Count, Raw Code Coverage Percentage, and Build Green Percentage Without Flake Adjustment.
The Test Count Anti-Pattern
In 2019, Uber's engineering team published a post-mortem on their "test inflation" crisis: between 2016 and 2018, their E2E test count grew from 1,200 to 7,000, while their deployment frequency dropped by 40%. The tests weren't finding bugs; they were creating friction. They deleted 70% of them—specifically, tests that hadn't failed in 12 months and covered code paths with < 0.01% production traffic. Deployment velocity recovered, and production incident rate remained flat.
If you're rewarding teams for "test quantity," you're incentivizing the copy-paste of trivial assertions. Instead, track test half-life: the median time between meaningful failures (failures that resulted in code changes, not test fixes). A test that hasn't failed meaningfully in 6 months is likely testing invariants, not behavior. Archive it.
Code Coverage Theater
JaCoCo 0.8.11, Istanbul 0.14, and Coverage.py 7.3 all suffer from the same metric: line coverage. A method with 100% line coverage can still harbor logic errors if branch coverage is 50%. Worse, teams game this by testing getters and setters, or by excluding // istanbul ignore next blocks for "untestable" code (which is usually exactly the code that will fail).
Shift to mutation testing using Pitest 1.15 for Java or StrykerJS 8.2 for TypeScript. Mutation testing introduces small code changes (mutants) and verifies that tests fail. A coverage metric of 90% with a mutation score of 35% reveals a test suite full of mock-heavy, assertion-light tests that execute code but don't verify behavior. Target mutation scores > 70% for domain logic, > 40% for infrastructure glue.
The "Green Build" Mirage
A 99% green build rate sounds healthy until you discover that 30% of your tests are quarantined—marked with @Disabled or .skip()—and another 20% are retries that passed on the third attempt. GitHub Actions and GitLab CI both default to showing "success" if the final retry passes, masking systemic instability.
Expose this with a Flake-Adjusted Reliability metric:
True Reliability = (Passed First Attempt / Total Runs) * 100
If this drops below 85%, your CI is producing noise, not signal. Configure your pipeline to fail fast on the first failure (JUnit 5's failIfNoTests, Cypress failFast plugin), and ban retry logic without incident tickets.
Coverage-by-Persona: The Missing Dimension
Traditional coverage tools answer: "Did we execute this line?" They don't answer: "Did we simulate the user who has 200 items in their cart, uses a screen reader, and pays with a corporate AMEX?"
Modern autonomous QA platforms, including SUSA, approach this by deploying behavioral personas—autonomous agents that explore applications (APK uploads or web URLs) with distinct user profiles: the "Power User" who navigates via keyboard shortcuts, the "Accessibility-First" user relying on TalkBack (Android 14) or VoiceOver (iOS 17), the "Low-Bandwidth" user on 3G throttling. These 10 personas generate exploration graphs that map to actual user telemetry from Firebase Analytics or Amplitude.
The metric here is Journey Coverage: the percentage of unique user flows (sequences of >3 screens) observed in production RUM (Real User Monitoring) that have been exercised in testing. Not lines of code—state transitions.
# Pseudocode for journey extraction from production logs
from collections import defaultdict
def extract_journeys(events, session_timeout=30):
"""
events: List of {user_id, timestamp, screen_name, action}
Returns: Set of tuples representing screen sequences
"""
sessions = defaultdict(list)
for event in sorted(events, key=lambda x: x['timestamp']):
sessions[event['user_id']].append(event['screen_name'])
journeys = set()
for session_events in sessions.values():
# Sliding window of 3-5 screens
for i in range(len(session_events) - 2):
journeys.add(tuple(session_events[i:i+3]))
return journeys
coverage_gap = production_journeys - tested_journeys
When SUSA's autonomous exploration identifies a "dead button" (a clickable element with no attached navigation event) or an ANR (Application Not Responding) on a specific device model (Samsung Galaxy S21, Android 13), it generates a Playwright or Appium script targeting that specific journey. This closes the coverage gap not by executing more lines, but by traversing under-explored state machines.
Track Coverage-by-Persona in your dashboard: what percentage of your accessibility-critical paths (WCAG 2.1 AA Level A violations) have been validated by an assistive technology persona versus a standard tap-through? If the ratio is skewed toward standard interaction, you're shipping a product that works for developers but fails for 15% of your user base.
Measuring Flakiness as Architecture Smell
Flaky tests are symptoms, not diseases. Root causes cluster into three categories: Time, State, and Environment. Instrument your test runner to tag failures with these dimensions.
Time-Based Flakiness
Occurs when tests depend on System.currentTimeMillis() or new Date() without mocking. In Java 21, use java.time.Clock injection. In JavaScript, use Sinon.js 17 or Jest 29's jest.useFakeTimers(). Measure this by running your suite in a loop:
#!/bin/bash
# stress-test.sh - Run suite 20 times, capture non-determinism
for i in {1..20}; do
mvn test -Dtest=OrderServiceTest 2>&1 | tee run_$i.log
done
# Analyze divergence
grep "Tests run:" run_*.log | sort | uniq -c
If you see mixed "Tests run: 15, Failures: 0" and "Tests run: 15, Failures: 1" for identical code, you have time coupling.
State-Based Flakiness
Shared test fixtures, database sequences, or filesystem temp directories. In Spring Boot 3.2, use @DynamicPropertySource to assign random ports and database names per test class. For PostgreSQL 16, use Testcontainers 1.19 with withReuse(false) to ensure isolation. Track this via Test Pollution Rate: the percentage of test failures that pass when run in isolation but fail in the full suite.
Environment-Based Flakiness
Browser version drift, device farm instability (BrowserStack 14 vs. Sauce Labs 10), or network latency to third-party sandboxes. These are infrastructure failures masquerading as test failures. Use Playwright 1.40's trace.retain-on-failure to capture network stall logs, and separate these into a distinct metric: Infrastructure Flake Rate vs. Code Flake Rate. If infrastructure flakes exceed 1%, your device farm SLA is violated, and you're training developers to ignore legitimate failures.
The Cost of Signal Latency: From Commit to Confidence
MTTR measures recovery time, but Signal Latency measures how long it takes to know you broke something. Break this down by pipeline stage:
| Stage | Target Latency | Metric Source | Failure Mode |
|---|---|---|---|
| Pre-commit (local) | < 2 min | Husky 8.0 git hooks | Lint/unit test failures |
| PR Validation | < 8 min | GitHub Actions pull_request workflow | Integration test failures |
| Post-merge | < 15 min | GitHub Actions push to main | E2E smoke test failures |
| Autonomous Exploration | < 2 hours | SUSA CLI in CI pipeline | Crash/ANR/security findings |
| Production Canary | < 5 min | Argo Rollouts analysis | P99 latency degradation |
The 2-hour window for autonomous exploration seems slow compared to unit tests, but it compensates with fidelity depth. While JUnit tests verify calculateTotal() returns 42, autonomous agents verify that rotating the device mid-transaction doesn't trigger an IllegalStateException in the Android 14 lifecycle. The metric to track is Time-to-Deep-Confidence: how long until you know the checkout flow works on a Pixel 7 with TalkBack enabled, not just that the API returns 200.
Optimize for feedback fidelity, not just speed. A 30-second pipeline that only runs mock-heavy unit tests provides false confidence. A 12-minute pipeline that includes containerized service tests with Testcontainers provides signal. Use GitHub Actions' job matrices to parallelize by persona or device profile, keeping the critical path under 10 minutes while running comprehensive exploration in parallel.
Security and Accessibility: Hard Metrics for Soft Failures
OWASP Mobile Top 10 (M1: Improper Credential Usage, M2: Inadequate Supply Chain Security, etc.) and WCAG 2.1 AA violations are often treated as "audit checklist" items rather than continuous metrics. This is a mistake. Security flaws and accessibility barriers are bugs with legal and financial blast radii.
Security Violation Half-Life
For each OWASP category, track the time from introduction (commit timestamp) to detection (SAST/DAST scan timestamp) to remediation (fix commit timestamp). The Half-Life is the median age of open vulnerabilities:
SELECT
owasp_category,
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (fixed_at - introduced_at))/3600
) as half_life_hours
FROM security_findings
WHERE status = 'fixed'
AND introduced_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY owasp_category;
Target: < 24 hours for M1-M3 (credentials, supply chain), < 72 hours for M4-M7 (auth, crypto). Tools like OWASP Dependency-Check 9.0 or Snyk CLI 1.123 provide the raw data, but you must instrument the lifecycle. SUSA's autonomous agents specifically probe for OWASP Mobile Top 10 vulnerabilities—hardcoded credentials in APK resources, insecure network configurations in AndroidManifest.xml, and weak SSL implementations—generating JUnit XML reports that fail the build if M1-M5 findings exceed a 0-hour threshold.
Accessibility Violation Decay Rate
WCAG 2.1 AA violations detected by axe-core 4.8 or Android's Accessibility Scanner should decay to zero within the sprint they are introduced. Track A11y Debt Accumulation: the count of Level A and AA violations per 1,000 lines of changed code. If this increases sprint-over-sprint, your components are becoming less accessible, not more.
The critical metric is Screen Reader Coverage: percentage of interactive elements reachable via linear navigation (swipe-through on mobile, tab-through on web) that have been validated with actual assistive technology, not just aria-label linting. SUSA's personas include screen reader navigation paths that verify TalkBack announcements match the visual label—catching "dead buttons" that are focusable but non-functional, a WCAG 4.1.2 failure that automated linting misses.
When to Stop Measuring: The Observability Ceiling
There is a point where additional QA metrics become organizational drag. The Observability Ceiling occurs when the cost of collecting, storing, and analyzing test data exceeds the value of the insights gained. For most teams, this hits around 50 distinct metrics.
Signs you've hit the ceiling:
- Dashboards that no one views for > 30 days
- Metrics that trigger alerts but never result in code changes
- "Test coverage" meetings that consume engineering hours without changing priorities
Apply the 80/20 heuristic: 80% of your quality risk comes from 20% of your code paths (the checkout flow, the auth service, the payment webhook). Focus intense observability on these critical paths—mutation testing, autonomous exploration, chaos engineering—while accepting lower fidelity (simple unit tests) for the long tail of internal admin tools.
Delete metrics that don't lead to action. If "API Response Time in Test Environment" doesn't correlate with production performance (it usually doesn't, due to network topology differences), stop tracking it. If "Code Coverage by Module" hasn't changed a prioritization decision in six months, archive the dashboard.
Building the QA Observability Stack: A Reference Architecture
A mature QA observability pipeline requires three layers: Instrumentation (how you collect), Aggregation (where you store), and Activation (how you respond).
Instrumentation Layer
- JUnit 5.10: Custom
TestExecutionListenerto emit OpenTelemetry traces for each test method, tagging with git SHA, test class, and flakiness score - Playwright 1.40 / Cypress 13: Built-in tracing to S3-compatible storage (MinIO 2024), with metadata extraction for journey mapping
- SUSA CLI: Integrated into GitHub Actions via
susa-action@v3, uploading APKs or URLs, receiving JUnit XML + SARIF security reports, feeding back into the pipeline
Aggregation Layer
- Prometheus 2.47: Scraping custom metrics from test runners via
/actuator/prometheus(Spring Boot) or Pushgateway for short-lived CI jobs - Grafana 10: Dashboards combining test metrics (
test_flake_rate,mttr_hours) with production metrics (error_rate,apdex_score) to expose escape rates - ClickHouse 24.1: For high-cardinality test execution logs (per-test duration, per-device-model results), queried via Grafana
Activation Layer
- GitHub Actions: Workflow commands to block merge on
flake_rate > 3%ormutation_score < 70% - Backstage 1.20: Software catalog integration showing per-service test health scores, pulling from Prometheus
- PagerDuty: Integration for "Test Suite Down" incidents when main branch stays red > MTTR threshold
Example Prometheus metric exposition:
# HELP test_flake_rate_30d Percentage of flaky test runs over 30 days
# TYPE test_flake_rate_30d gauge
test_flake_rate_30d{team="checkout", layer="e2e"} 4.2
# HELP test_mttr_hours Mean time to repair broken main branch
# TYPE test_mttr_hours gauge
test_mttr_hours{team="checkout"} 1.5
SUSA's integration fits naturally here: the CLI returns exit code 1 on critical security findings (OWASP M1-M3) or accessibility violations (WCAG Level A), preventing deployment. The JUnit XML output includes tags for cross-session learning—if a crash is found on a Samsung S21 in one run, subsequent runs prioritize that device profile, optimizing the exploration budget.
The Metrics You Should Delete This Quarter
If you own the QA dashboard, schedule these deletions:
- "Total Number of Test Cases" → Replace with "Test Half-Life" (median time between meaningful failures)
- "Code Coverage %" → Replace with "Mutation Score" (Pitest/Stryker) or "Journey Coverage" (RUM correlation)
- "Build Success Rate" → Replace with "Flake-Adjusted Reliability" (first-attempt pass rate)
- "Test Execution Time" → Replace with "Signal Latency by Stage" (time-to-confidence per layer)
- "Defects Found in QA" → Replace with "Bug Escape Rate by Layer" (production/QA ratio)
Migration path: Keep the old metrics visible but grayed out for one quarter to prevent panic, while highlighting the new metrics in weekly standups. When developers notice that "Mutation Score" correlates with Friday night pages while "Code Coverage %" doesn't, the cultural shift completes itself.
Delete the dashboard widgets showing "Test Cases Written per Sprint." They incentivize quantity over quality, and you will not miss them. Replace them with a single number: the Confidence Interval—the percentage of production hotfixes that were preceded by a failing test in the previous 7 days. When this hits 90%, you know your observability is working. When it stays below 40%, your metrics are theater, and your users are the unpaid QA team.
Test Your App Autonomously
Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.
Try SUSA Free