The Hidden Cost of Flaky Tests: Building Deterministic E2E Suites | desplega.ai

Debugging flaky tests with deterministic patterns

You've seen it before: a test passes on your machine, fails in CI, passes when you retry it, then fails again tomorrow morning. Your team starts ignoring test failures. Pull requests get merged with "known flaky test" comments. Before you know it, your entire test suite becomes a coin flip instead of a safety net.

Flaky tests are more than annoying—they're expensive. Teams waste hours debugging false positives, developers lose trust in automation, and worst of all, real bugs slip through when everyone assumes failures are "just flakiness." Today, we're diving deep into the root causes of test flakiness and the architectural patterns that eliminate it permanently.

The Four Horsemen of Test Flakiness

After analyzing thousands of flaky tests across Playwright, Cypress, and Selenium codebases, four patterns emerge repeatedly:

1. Race Conditions and Timing Issues

The most common culprit. Your test clicks a button before the JavaScript finishes loading, or checks a value before the API response arrives. The deadly pattern looks like this:

// ❌ FLAKY: Race condition waiting to happen
await page.click('#submit-button');
await page.click('#next-step'); // May execute before submit completes

// ✅ DETERMINISTIC: Wait for the operation to complete
await page.click('#submit-button');
await page.waitForResponse(resp => 
  resp.url().includes('/api/submit') && resp.status() === 200
);
await expect(page.locator('#next-step')).toBeVisible();
await page.click('#next-step');

The fix isn't just adding arbitrary sleep(5000) calls—that makes tests slower and still unreliable. Instead, wait for specific conditions that prove the system is ready.

2. State Pollution Between Tests

Tests pass when run individually but fail when run as a suite because they share state. Classic symptoms include:

Database records created by test A interfere with test B's assertions
LocalStorage or cookies persist across tests
Global JavaScript variables leak between test contexts
Test execution order determines pass/fail outcomes

// ❌ FLAKY: State leaks between tests
test('create user', async () => {
  await createUser({ email: 'test@example.com' });
  expect(await getUsers()).toHaveLength(1);
});

test('user list is empty', async () => {
  expect(await getUsers()).toHaveLength(0); // FAILS if previous test ran
});

// ✅ DETERMINISTIC: Isolate each test
test.beforeEach(async ({ page }) => {
  await resetDatabase();
  await page.context().clearCookies();
  await page.evaluate(() => localStorage.clear());
});

test('create user', async () => {
  await createUser({ email: 'test@example.com' });
  expect(await getUsers()).toHaveLength(1);
});

test('user list is empty', async () => {
  expect(await getUsers()).toHaveLength(0); // ALWAYS passes
});

3. Environment Dependencies

Your tests make assumptions about the environment that aren't always true. Network latency varies between dev machines and CI runners. Date/time calculations break across timezones. File paths differ between Windows and Linux.

// ❌ FLAKY: Depends on system timezone
test('shows correct date', async () => {
  const date = new Date('2026-01-14T10:00:00Z');
  expect(formatDate(date)).toBe('January 14, 2026'); // Breaks in non-UTC zones
});

// ✅ DETERMINISTIC: Use fixed timezone or relative assertions
test('shows correct date', async () => {
  const date = new Date('2026-01-14T10:00:00Z');
  expect(formatDate(date)).toMatch(/January 14, 2026/);
  // Or better: mock the system clock
  await page.clock.setFixedTime(new Date('2026-01-14T10:00:00Z'));
});

4. External Service Volatility

Tests that hit real third-party APIs, payment gateways, or authentication providers inherit all their flakiness. Service downtime, rate limits, and network hiccups turn your deterministic code into a probabilistic nightmare.

// ❌ FLAKY: Depends on external API availability
test('fetches weather data', async () => {
  const response = await fetch('https://api.weather.com/current');
  expect(response.status).toBe(200); // Fails when API is down/slow
});

// ✅ DETERMINISTIC: Mock external dependencies
test('fetches weather data', async ({ page }) => {
  await page.route('**/api.weather.com/**', route => {
    route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify({ temp: 72, conditions: 'sunny' })
    });
  });
  
  const weather = await getWeather();
  expect(weather.temp).toBe(72);
});

Debugging Techniques: Finding the Root Cause

When you encounter a flaky test, resist the urge to immediately add retry logic or increase timeouts. Those are band-aids. Here's the systematic debugging approach:

1. Reproduce Consistently

Run the test 50-100 times to understand failure frequency. Playwright makes this easy:

# Run test 100 times and see failure rate
for i in {1..100}; do 
  npx playwright test flaky-test.spec.ts --reporter=line
done | grep -c "failed"

# Or use Playwright's repeat-each flag
npx playwright test --repeat-each=100 flaky-test.spec.ts

If it fails 5% of the time, you have flakiness. If it fails 50%, it's a different test environment issue. If it never fails locally but fails in CI, it's an environment dependency problem.

2. Enable Video and Trace Recording

Configure your test runner to capture artifacts only on failures:

// playwright.config.ts
export default defineConfig({
  use: {
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
    trace: 'retain-on-failure',
  },
});

Watch the video of a failed run. You'll often see exactly what happened: a button click that occurred before it was clickable, a network request that timed out, or an element that never appeared.

3. Analyze Network Timing

Most E2E flakiness is network-related. Enable detailed network logging:

// Log all network activity during test
test('flaky test', async ({ page }) => {
  page.on('request', request => {
    console.log('>>>', request.method(), request.url());
  });
  page.on('response', response => {
    console.log('<<<', response.status(), response.url());
  });
  
  // Your test code here
});

Look for patterns: Do requests complete out of order? Are responses slow? Does the failure correlate with specific API calls?

Architectural Patterns for Deterministic Tests

Once you've debugged the issues, apply these architectural patterns to prevent flakiness permanently:

Pattern 1: Smart Wait Strategies

Replace arbitrary timeouts with condition-based waits:

// Wait for specific network activity to complete
async function waitForDataLoaded(page: Page) {
  await page.waitForLoadState('networkidle');
  await page.waitForSelector('[data-testid="content"]', { state: 'visible' });
  await page.waitForFunction(() => {
    const spinner = document.querySelector('.loading-spinner');
    return spinner === null || spinner.style.display === 'none';
  });
}

// Wait for API-driven state changes
async function submitFormAndWait(page: Page) {
  const responsePromise = page.waitForResponse(
    resp => resp.url().includes('/api/submit') && resp.ok()
  );
  
  await page.click('[data-testid="submit"]');
  const response = await responsePromise;
  
  // Verify the UI updated based on response
  await expect(page.locator('.success-message')).toBeVisible();
}

Pattern 2: Test Data Builders

Create isolated, predictable test data for each test:

// Seed unique test data per test run
class TestDataBuilder {
  private testId = Date.now() + Math.random();
  
  createUser(overrides = {}) {
    return {
      email: `user-${this.testId}@test.com`,
      username: `testuser-${this.testId}`,
      ...overrides
    };
  }
  
  async seedDatabase() {
    await db.users.deleteMany({ email: { $regex: /test\.com$/ } });
    return {
      user: await db.users.create(this.createUser()),
      admin: await db.users.create(this.createUser({ role: 'admin' }))
    };
  }
}

test('user can update profile', async ({ page }) => {
  const builder = new TestDataBuilder();
  const { user } = await builder.seedDatabase();
  
  await page.goto(`/profile/${user.id}`);
  // Test uses isolated data, no conflicts
});

Pattern 3: Service Layer Mocking

Mock external dependencies at the network boundary:

// Create reusable mock handlers
class MockServiceLayer {
  constructor(private page: Page) {}
  
  async mockPaymentSuccess() {
    await this.page.route('**/api/stripe/**', route => {
      route.fulfill({
        status: 200,
        contentType: 'application/json',
        body: JSON.stringify({
          id: 'pi_test_123',
          status: 'succeeded',
          amount: 1000
        })
      });
    });
  }
  
  async mockAuthService(userRole: string) {
    await this.page.route('**/api/auth/**', route => {
      route.fulfill({
        status: 200,
        body: JSON.stringify({ user: { role: userRole, id: '123' } })
      });
    });
  }
}

test('checkout flow', async ({ page }) => {
  const mocks = new MockServiceLayer(page);
  await mocks.mockPaymentSuccess();
  
  // Test is deterministic—no real payment API called
});

Pattern 4: Idempotent Test Setup

Make setup operations safe to run multiple times:

// Idempotent setup: safe to run regardless of current state
async function ensureTestUserExists(email: string) {
  const existing = await db.users.findOne({ email });
  if (existing) {
    await db.users.update({ email }, { verified: true, loginCount: 0 });
    return existing;
  }
  return await db.users.create({ email, verified: true });
}

test.beforeEach(async () => {
  // Always returns same result, no matter how many times it runs
  const user = await ensureTestUserExists('test@example.com');
  await loginAs(user);
});

Metrics: Tracking and Preventing Flakiness

You can't improve what you don't measure. Implement flakiness tracking:

// Track test outcomes over time
interface TestResult {
  testName: string;
  status: 'passed' | 'failed';
  duration: number;
  timestamp: Date;
  commitSha: string;
}

// Calculate flakiness rate
function calculateFlakiness(results: TestResult[]) {
  const byTest = groupBy(results, r => r.testName);
  
  return Object.entries(byTest).map(([testName, runs]) => {
    const totalRuns = runs.length;
    const failures = runs.filter(r => r.status === 'failed').length;
    const flakiness = (failures / totalRuns) * 100;
    
    return {
      testName,
      flakiness: `${flakiness.toFixed(1)}%`,
      shouldInvestigate: flakiness > 0 && flakiness < 100
    };
  });
}

Set up CI monitoring to automatically flag tests with 1-99% failure rates. These are your flaky tests. Tests that fail 100% have legitimate bugs. Tests that pass 100% are reliable.

Quarantine Strategy

When you identify flaky tests but can't fix them immediately, quarantine them:

// playwright.config.ts
export default defineConfig({
  projects: [
    {
      name: 'stable',
      testMatch: /^(?!.*\.quarantine\.spec\.ts$).*\.spec\.ts$/,
    },
    {
      name: 'quarantine',
      testMatch: /.*\.quarantine\.spec\.ts$/,
      retries: 3, // Allow retries only for quarantined tests
    },
  ],
});

Run stable tests on every commit. Run quarantined tests nightly with retry logic. This prevents flaky tests from blocking PRs while still alerting you to real regressions.

The Flakiness Elimination Checklist

Before marking any test complete, verify it passes this checklist:

Runs 100 times consecutively without failure on your local machine
Uses explicit waits for all async operations (network, animations, state changes)
Cleans up its own state in beforeEach/afterEach hooks
Mocks external services that aren't under your control
Uses unique test data that won't collide with parallel test runs
Passes in CI environment with same success rate as local
Works regardless of execution order when run with --shuffle flag
Handles timing variations between fast dev machines and slower CI runners

Framework-Specific Best Practices

Playwright

Use page.waitForLoadState('networkidle') after navigation
Leverage auto-waiting: Playwright waits for elements to be actionable before clicking
Use test.describe.configure() to run tests in serial when needed
Enable fullyParallel: true to catch state pollution issues early

Cypress

Chain commands properly: cy.get().should().click() retries the entire chain
Use cy.intercept() to stub network requests reliably
Avoid cy.wait(5000)—use cy.wait('@apiAlias') instead
Clear cookies/localStorage in beforeEach() hook

Selenium

Use WebDriverWait with explicit conditions, never implicit waits
Implement custom ExpectedConditions for complex scenarios
Always call driver.quit() in teardown to prevent session leaks
Use headless mode in CI but keep headed mode for debugging

Conclusion: The Trust Equation

Deterministic tests aren't just about eliminating false negatives—they're about building trust. When your test suite is reliable, developers stop ignoring failures. When tests are fast and stable, teams run them before every commit instead of after. When automation catches real bugs consistently, QA becomes a force multiplier instead of a bottleneck.

The path from flaky to deterministic isn't quick, but it's systematic. Identify root causes using video traces and network logs. Apply architectural patterns that eliminate timing dependencies and state pollution. Track metrics to prevent regression. Most importantly, never accept "it's just flaky" as a permanent state—every unreliable test is a solvable problem.

Start today: pick your flakiest test, run it 100 times, watch the failure recording, and apply the patterns above. Once you see a test go from 5% failure rate to 0%, you'll never accept flakiness again.

Need Help Building Deterministic Test Suites?

Desplega AI helps teams eliminate flaky tests and build reliable CI/CD pipelines. Our agentic workflow orchestration ensures your test automation is deterministic, fast, and trustworthy.

Explore Desplega AI →

The Hidden Cost of Flaky Tests: How to Build Deterministic E2E Test Suites

Master the art of eliminating unreliable tests that pass sometimes, fail randomly, and destroy team confidence in automation