Flaky Tests: Why They Really Happen (And How to Fix Them)
Flaky tests. The bane of every CI/CD pipeline, the silent productivity killer, the phantom bug that haunts late-night debugging sessions. The conventional wisdom often points to race conditions, netwo
The Illusion of Randomness: Unmasking the True Causes of Flaky Tests
Flaky tests. The bane of every CI/CD pipeline, the silent productivity killer, the phantom bug that haunts late-night debugging sessions. The conventional wisdom often points to race conditions, network latency, or insufficient waits. While these are certainly *symptoms*, they rarely represent the *root cause*. We’ve all been there: a test passes reliably for days, then suddenly fails, only to pass again on the next run. The temptation is to slap in a Thread.sleep(5000) or add a retry mechanism, a temporary bandage that often obscures the underlying disease. This approach, while seemingly expedient, perpetuates a cycle of technical debt and erodes confidence in our automated testing suites.
This isn't about magic bullets or silver linings. It's about dissecting the complex interplay of factors that contribute to test instability, and understanding that true flakiness stems from a deeper architectural and systemic disconnect, not mere transient environmental noise. We'll delve into a taxonomy of flakiness, exploring concrete examples and providing actionable strategies that go beyond superficial fixes. We'll discuss how modern testing platforms, like SUSA, are engineered to address these systemic issues by generating more robust and deterministic test artifacts.
The Taxonomy of True Flakiness
Instead of a vague "it's flaky," we need a structured understanding. Flakiness can be broadly categorized into several interconnected areas:
- State Management and Shared Resources: How tests interact with and modify shared application state.
- Event and Interaction Ordering: The non-deterministic sequence of asynchronous events in an application.
- Environment and Infrastructure Dependencies: The subtle timing and resource contention inherent in execution environments.
- Application Logic Under Test: How the application itself introduces non-determinism.
Let's unpack each of these with specific examples and engineering-level solutions.
1. State Management and Shared Resources: The Unseen Saboteur
This is perhaps the most insidious category of flakiness. When tests aren't properly isolated, they can leave behind side effects that impact subsequent tests. This is particularly problematic in parallel execution environments or when tests are run repeatedly against the same application instance.
#### 1.1. Global State and Singleton Misuse
Applications often rely on global state or singletons to manage application-wide configurations, user sessions, or data caches. If tests don't meticulously reset these states between runs, one test can inadvertently corrupt the environment for another.
Example: Consider a mobile app that uses a singleton UserManager to hold the currently logged-in user's data.
// In your Android app
public class UserManager {
private static UserManager instance;
private User currentUser;
private UserManager() {}
public static synchronized UserManager getInstance() {
if (instance == null) {
instance = new UserManager();
}
return instance;
}
public void loginUser(User user) {
this.currentUser = user;
}
public User getCurrentUser() {
return currentUser;
}
// Crucial for testing: a way to reset state
public void logoutUser() {
this.currentUser = null;
}
}
If a test logs in a user and doesn't explicitly call UserManager.getInstance().logoutUser() afterwards, a subsequent test expecting no logged-in user will fail. This failure might appear random if tests are executed in different orders or if the application is restarted between runs.
The Fix:
- Test-Aware State Reset: The most robust solution is to design your application with testability in mind. This means providing explicit methods to reset or clear critical global states. For the
UserManagerexample,logoutUser()is essential. - Dependency Injection (DI): Frameworks like Dagger (Java/Kotlin) or Koin (Kotlin) allow you to manage dependencies. In a test environment, you can inject mock or resetable instances of your singletons, ensuring each test gets a clean slate.
- Example with Hilt (Android DI):
@Module
@InstallIn(SingletonComponent.class)
public abstract class AppModule {
@Provides
@Singleton
public static UserManager provideUserManager() {
return new UserManager(); // Production instance
}
}
@Module
@InstallIn(SingletonComponent.class)
public abstract class TestAppModule {
@Provides
@Singleton
public static UserManager provideTestUserManager() {
// Inject a test-specific, resettable instance
return new TestUserManager();
}
}
Your test runner would then use TestAppModule to inject a TestUserManager that has a reset() method.
- Application-Level Reset: For UI tests, ensure your test setup tears down the application or restarts the activity/fragment to clear in-memory state. Espresso, for instance, has
@Rules likeActivityScenarioRulethat can relaunch activities.
#### 1.2. Database and Persistent Storage Side Effects
Tests that modify databases, SharedPreferences (Android), UserDefaults (iOS), or local storage can leave data that influences subsequent tests.
Example: A test that adds a new item to a user's shopping cart. If the next test expects an empty cart, it will fail.
The Fix:
- Database Seeding and Teardown:
- Before each test: Seed the database with known, clean data.
- After each test: Delete all data or roll back transactions. For SQL databases, this often involves
TRUNCATE TABLEorDELETE FROM tablestatements, or using in-memory SQLite databases for tests. - Example (JUnit 4 Rule for Robolectric):
import org.junit.rules.ExternalResource;
import org.robolectric.shadows.ShadowSQLiteDatabase;
public class ClearDatabaseRule extends ExternalResource {
@Override
protected void before() {
// Ensure database is clean before test
ShadowSQLiteDatabase.getShadowDb(null).execSQL("DELETE FROM your_table");
}
@Override
protected void after() {
// Optional: Clean up after test if needed, though 'before' usually suffices for isolation
}
}
SharedPreferences.edit().clear().apply() or equivalent to reset settings before each test.#### 1.3. Concurrent Access and Thread Safety
When multiple threads or asynchronous operations modify shared data simultaneously without proper synchronization, you get classic race conditions. While this is a bug in the application, it often manifests as test flakiness because the timing of these races is unpredictable.
Example: A counter incremented by a background service and a UI button tap.
// Simplified example
private int counter = 0;
// Called by UI button
public void incrementFromUI() {
counter++; // Potential race condition
}
// Called by background service
public void incrementFromService() {
counter++; // Potential race condition
}
If both operations occur concurrently, the final counter value might be 1 instead of 2. A test expecting counter == 2 would fail.
The Fix:
- Synchronization Primitives: Use
synchronizedblocks,Lockobjects (e.g.,ReentrantLock),AtomicInteger, or concurrent collections (e.g.,ConcurrentHashMap) in your application code to ensure thread-safe access to shared mutable state. - Test for Concurrency: Design specific tests that deliberately trigger concurrent operations to expose these issues. Frameworks like
CountDownLatchandCyclicBarriercan help orchestrate concurrent threads in tests. - Example using
CountDownLatch:
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
// ... in your test
@Test
public void testConcurrentCounterIncrement() throws InterruptedException {
int numThreads = 10;
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
CountDownLatch latch = new CountDownLatch(numThreads);
Counter counter = new Counter(); // Assume Counter has synchronized increment
for (int i = 0; i < numThreads; i++) {
executor.submit(() -> {
counter.increment();
latch.countDown();
});
}
latch.await(); // Wait for all threads to complete
assertEquals(numThreads, counter.getValue()); // Assert final value
executor.shutdown();
}
2. Event and Interaction Ordering: The Asynchronous Maze
Modern applications are heavily reliant on asynchronous operations: network calls, background processing, animations, user gestures, and system events. The order in which these events fire can have a dramatic impact on the application's state and, consequently, on test outcomes.
#### 2.1. Network Call Timing
Tests often interact with APIs. If a test makes an API call and then immediately tries to assert on the result before the call has completed and the UI has updated, it will fail. This is often misattributed to network latency.
Example: A test that fetches user data and then checks if the user's name is displayed.
// Pseudo-code with UI testing framework
@Test
public void testUserNameDisplay() {
// Simulate user login and data fetch
apiClient.fetchUserData(); // Asynchronous call
// PROBLEM: This assertion might happen BEFORE the UI updates
assertThat(userNameTextView.getText()).isEqualTo("John Doe");
}
The Fix:
- Explicit Waits (Intelligent, Not Blind): Instead of fixed
sleepcalls, use condition-based waits. Most UI testing frameworks provide mechanisms for this. - Appium (WebDriverWait):
import org.openqa.selenium.By;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
// ... in your test
WebDriver driver = ...; // Your Appium driver instance
WebDriverWait wait = new WebDriverWait(driver, 30); // Wait up to 30 seconds
// Wait for the element to be visible and contain the expected text
wait.until(ExpectedConditions.visibilityOfElementLocated(By.id("userNameTextView")));
WebElement userNameElement = driver.findElement(By.id("userNameTextView"));
assertEquals("John Doe", userNameElement.getText());
// ... in your test
await page.waitForSelector('#userNameTextView');
await expect(page.locator('#userNameTextView')).toHaveText('John Doe');
import okhttp3.mockwebserver.MockResponse;
import okhttp3.mockwebserver.MockWebServer;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class NetworkMockingTest {
private MockWebServer mockWebServer;
private ApiClient apiClient; // Your API client
@Before
public void setup() throws IOException {
mockWebServer = new MockWebServer();
mockWebServer.start();
// Configure your API client to use the mock server's URL
apiClient = new ApiClient(mockWebServer.url("/").toString());
}
@After
public void tearDown() throws IOException {
mockWebServer.shutdown();
}
@Test
public void testUserDataDisplayWhenMocked() throws IOException {
// Prepare mock response
String jsonResponse = "{\"name\": \"Jane Doe\", \"id\": 123}";
mockWebServer.enqueue(new MockResponse().setBody(jsonResponse).setResponseCode(200));
// Trigger the action that calls your API
apiClient.fetchUserData(); // This call now goes to MockWebServer
// Assert based on the mocked data, without waiting for network
// Assuming your app updates UI after fetchUserData completes internally
// Use appropriate UI assertion mechanism here
// For example, if you have a callback or LiveData
// Assert that the UI shows "Jane Doe"
}
}
#### 2.2. Animation and Transition Timing
UI animations and transitions, while improving user experience, introduce timing dependencies. A test that tries to interact with an element that is animating into view or out of view can fail if it attempts interaction too early or too late.
Example: A test trying to tap a button that slides in from the side.
The Fix:
- Wait for Animation Completion: Most UI frameworks provide ways to detect when animations are finished.
- Android (ViewPropertyAnimator):
view.animate().translationX(500).setListener(new AnimatorListenerAdapter() {
@Override
public void onAnimationEnd(Animator animation) {
// Now it's safe to interact with the view
// Trigger your test assertion or next interaction here
}
});
UIView.animate(withDuration: 0.5, animations: {
self.myView.alpha = 1.0 // Fade in
}) { _ in
// Animation completed, safe to interact
// Perform test assertions
}
adb shell settings put global window_animation_scale 0 &
adb shell settings put global transition_animation_scale 0 &
adb shell settings put global animator_duration_scale 0
Remember to reset these to 1.0 after tests.
- Espresso:
Espresso.disableAnimations()(available in newer versions).
#### 2.3. Event Queue and Message Loop Delays
The underlying operating system and application frameworks use event queues to process user input, system events, and background tasks. Delays or unexpected orderings in this queue can lead to flakiness.
Example: A test that triggers a long-running background task and then immediately tries to interact with a UI element that depends on the task's completion. The UI thread might be busy processing other events, delaying the update.
The Fix:
- Use Framework-Specific Schedulers/Dispatchers: Ensure your application logic correctly uses the appropriate threading models. For example, Android's
MainThreadExecutororDispatchers.Mainin Kotlin Coroutines for UI updates. - Test with Deliberate Delays: Introduce small, controlled delays in your test setup to simulate real-world event queue behavior, but avoid hardcoded
sleeps. Instead, use framework mechanisms that yield control and wait for specific conditions. - Robolectric: Robolectric provides
ShadowLooperwhich allows you to control the main thread's message queue.
import org.junit.Test;
import org.junit.runner.RunWith;
import org.robolectric.RobolectricTestRunner;
import org.robolectric.Shadows;
import org.robolectric.shadows.ShadowLooper;
@RunWith(RobolectricTestRunner.class)
public class EventQueueTest {
@Test
public void testDelayedUIUpdate() {
// ... setup your activity or view ...
ShadowLooper looper = Shadows.shadowOf(Looper.getMainLooper());
// Schedule a task that updates UI after a delay
new Handler(Looper.getMainLooper()).postDelayed(() -> {
// Update UI element
}, 5000); // 5-second delay
// Advance the looper by less than 5 seconds
looper.idleFor(3000);
// Assert that the UI is NOT updated yet
// Advance the looper by enough time to cover the delay
looper.idleFor(2000); // Total 5000ms advanced
// Assert that the UI IS updated
}
}
3. Environment and Infrastructure Dependencies: The Unseen Variables
The environment where tests run – emulators, simulators, real devices, cloud grids – introduces its own set of variables that can contribute to flakiness.
#### 3.1. Emulator/Simulator Timing and Resource Contention
Emulators and simulators are not perfect replicas of real hardware. Their performance can vary significantly based on the host machine's resources, background processes, and the emulator's own internal scheduling. This variability can lead to timing differences that break tests.
Example: A test that relies on a specific animation duration or a quick response to a system dialog might fail on a slower emulator instance.
The Fix:
- Use Real Devices: For critical end-to-end tests, running on real devices (either physical or cloud-based device farms like BrowserStack, Sauce Labs, or SUSA's offering) is often more reliable than emulators/simulators.
- Consistent Emulator/Simulator Configuration: Ensure all developers and CI agents use identical emulator/simulator images and configurations. Lock down versions.
- Resource Allocation: On CI, ensure sufficient CPU and RAM are allocated to the machines running emulators.
- Emulator Performance Tuning: Some emulators allow performance tuning. For instance, Android Emulators can be configured with specific CPU cores and RAM.
- SUSA's Autonomous Exploration: Platforms like SUSA can run on a variety of environments, including real device clouds. By abstracting away the direct management of emulators and simulators, they reduce the impact of individual environment inconsistencies. Their agents are designed to be resilient to minor timing variations.
#### 3.2. Network Stubbing and Mocking (Beyond API)
While API mocking is crucial, flakiness can also stem from other network-related factors: DNS resolution delays, intermittent connectivity drops (especially on mobile), or slow loading of static assets.
Example: A test that loads a webpage and expects certain images to be present. If the CDN hosting those images is slow or intermittently unavailable, the test might fail.
The Fix:
- Network Virtualization: Tools like Charles Proxy, mitmproxy, or network virtualization services can intercept and control network traffic. You can simulate slow connections, packet loss, or redirect requests to local stubs.
- Offline Testing: Where possible, design tests to run in an offline mode, using pre-cached assets or local mocks.
- Dedicated Test Networks: For critical infrastructure, consider using dedicated, high-bandwidth network connections for your CI/CD agents.
#### 3.3. CI/CD Environment Resource Saturation
CI/CD agents often run multiple tests or build jobs concurrently. If an agent is overloaded with CPU, memory, or disk I/O, it can cause tests to run slower, time out, or behave erratically.
Example: A build job that starts a long-running database migration and then tries to run UI tests on a machine that's already struggling.
The Fix:
- Monitor CI Agent Resources: Implement monitoring for your CI agents to detect resource bottlenecks.
- Isolate Test Suites: Run different types of tests (unit, integration, E2E) on separate, dedicated agents or at different times.
- Optimize Build Processes: Ensure your build and test scripts are efficient and don't consume excessive resources unnecessarily.
- Auto-Scaling CI Infrastructure: For cloud-based CI, leverage auto-scaling capabilities to ensure sufficient resources are available during peak times.
4. Application Logic Under Test: The Unintended Complexity
Sometimes, the application code itself is the source of non-determinism, leading to seemingly random test failures.
#### 4.1. Randomness and Non-Determinism in App Logic
Some applications intentionally incorporate randomness, for example, in game mechanics, personalized content delivery, or A/B testing features. While this might be desired for production, it's a nightmare for deterministic testing.
Example: A game test that relies on a specific sequence of random events to pass a level.
The Fix:
- Seeding Random Number Generators (RNGs): If your application uses RNGs, provide a way to seed them with a fixed value during test runs. This ensures that the "random" sequence is identical every time.
- Java:
java.util.Random random = new java.util.Random(testSeed); - Python:
random.seed(test_seed) - Feature Flags for Testability: Implement feature flags that can disable or control random behavior during testing.
- Mocking Randomness: Mock the RNG itself to return predictable sequences of numbers.
#### 4.2. Complex Asynchronous Workflows and Event Handlers
Applications with deeply nested asynchronous operations, complex event bubbling, or intricate state machines can be inherently difficult to test deterministically. A slight variation in timing can lead to different execution paths.
Example: A multi-step wizard where user input triggers background data validation, which in turn updates UI elements and enables further steps, all asynchronously.
The Fix:
- Simplify Asynchronous Logic: Refactor complex asynchronous workflows into more manageable, observable units.
- Event Sourcing/Replay: For complex state management, consider using event sourcing. During testing, you can replay a known sequence of events to reach a specific state deterministically.
- State Machine Testing: If your application logic can be modeled as a state machine, use state machine testing tools and techniques. Ensure transitions are predictable and testable.
- SUSA's Autonomous Exploration: SUSA's agents explore the application by interacting with it as a user would, but their exploration is guided by AI to cover significant paths. This means they encounter complex asynchronous flows. Crucially, SUSA then *translates* these successful exploration paths into structured, deterministic regression scripts (Appium, Playwright). This process inherently de-flakifies the observed behavior by creating code that explicitly waits for UI states and sequences actions predictably, rather than relying on implicit, timing-dependent behavior.
#### 4.3. Third-Party SDKs and Integrations
External SDKs (analytics, ad networks, payment gateways) can introduce their own non-determinism, especially if they perform network operations or background tasks that aren't easily controlled by your tests.
Example: An analytics SDK that fires events asynchronously. A test might complete before all analytics events are sent, leading to discrepancies in reported data that might be checked by a subsequent test or monitoring system.
The Fix:
- Mocking SDK Behavior: Mock the SDK's internal behavior or its network calls. This allows you to control what it reports or does.
- SDK Configuration for Testing: Some SDKs offer specific configurations for testing environments (e.g., disabling actual network calls, using test endpoints).
- Isolate SDK Interactions: If possible, create adapter layers around third-party SDKs. This makes it easier to mock or stub their behavior during tests without altering the core application logic.
The Systemic Approach: Beyond Individual Fixes
While the categories above provide a taxonomy for understanding flakiness, the real solution lies in a systemic approach to test design and development.
#### 1. Test Infrastructure as Code (IaC)
Treat your testing environment configuration, device farms, and emulator setups with the same rigor as your production infrastructure. Use tools like Docker, Terraform, or Kubernetes to define and manage your testing environments consistently. This minimizes environment-related flakiness.
#### 2. Data-Driven Testing and Test Data Management
Ensure that test data is well-managed, isolated, and consistent.
- Generate Test Data: Use scripts to generate specific, known test data before test runs.
- Clean Data: Implement robust data cleanup mechanisms after tests.
- Data Isolation: For tests that modify data, ensure they operate on isolated datasets or that changes are fully reversible.
#### 3. Observability in Testing
Just as you monitor production applications, monitor your test execution.
- Logging: Implement detailed logging within your tests to capture the exact sequence of actions and any errors.
- Metrics: Track test execution times, failure rates, and identify patterns of flakiness.
- Screenshots/Videos: Capture screenshots or record videos on test failures. This is invaluable for debugging.
- SUSA's autonomous agents: Automatically capture screenshots and record videos for every exploration run, which is then used to generate more robust scripts. These artifacts are also available for failed scripted runs, aiding debugging.
#### 4. Shifting Left with Testability in Mind
Testability shouldn't be an afterthought.
- Design for Testability: Architects and developers should consider how components will be tested during the design phase. This includes providing hooks for state reset, dependency injection, and clear asynchronous communication patterns.
- Code Reviews: Include testability considerations in code reviews.
#### 5. Leveraging Intelligent Test Generation
Manually writing and maintaining robust test scripts for complex applications is a significant undertaking. Flakiness can creep in as tests become brittle. Platforms that can intelligently generate tests from observed user behavior offer a powerful way to create a baseline of stable regression tests.
Example: SUSA's autonomous QA platform takes an APK or URL, deploys it to its exploration environment, and uses AI-powered personas to interact with the application. During this exploration, it identifies crashes, ANRs, accessibility violations (WCAG 2.1 AA), security vulnerabilities (OWASP Mobile Top 10), and UX friction points. Crucially, for every successful exploration path that covers critical user flows, SUSA auto-generates regression scripts in formats like Appium or Playwright. These generated scripts are inherently more robust because they are derived from actual successful interactions, incorporating necessary waits and state management implicitly discovered by the AI. This significantly reduces the manual effort of creating deterministic tests and provides a strong starting point for a stable regression suite.
Furthermore, SUSA’s cross-session learning means that as the platform explores your application over time, it becomes more intelligent about your app's specific behaviors and potential failure points, leading to even more targeted and effective script generation. This continuous improvement loop helps combat the gradual introduction of flakiness that often occurs with manual script maintenance.
Conclusion: From Ephemeral Failures to Deterministic Confidence
Flaky tests are not a sign of bad luck; they are a symptom of underlying architectural issues, environmental inconsistencies, or a lack of proper test design. The common advice to simply add more waits or retries is a superficial fix that often masks deeper problems, leading to a false sense of security and increased technical debt.
By understanding the true root causes – shared state, event ordering, environmental factors, and application logic non-determinism – and by adopting systemic solutions like IaC, test observability, and designing for testability, we can move from a cycle of chasing phantom bugs to building a robust and reliable automated testing foundation. Platforms that can intelligently explore applications and generate deterministic test artifacts, like SUSA, offer a powerful strategy for creating and maintaining stable regression suites, freeing up valuable engineering time and boosting confidence in our release cycles. The goal is not to eliminate all transient failures, but to ensure that when a test fails, it fails for a clear, reproducible reason, allowing us to fix it and move forward with certainty.
Test Your App Autonomously
Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.
Try SUSA Free