Skip to content

Flaky test audit#7418

Draft
CharlieTLe wants to merge 30 commits intocortexproject:masterfrom
CharlieTLe:flaky-test-audit
Draft

Flaky test audit#7418
CharlieTLe wants to merge 30 commits intocortexproject:masterfrom
CharlieTLe:flaky-test-audit

Conversation

@CharlieTLe
Copy link
Copy Markdown
Member

Summary

  • Systematic audit to discover flaky tests in the Cortex test suite
  • This branch is identical to master with no test code changes — any test failure is by definition a flaky test
  • Tracking files in flaky-tests/ document each flaky test with build logs, job links, and occurrence count
  • Tests that flake 3+ times will be auto-skipped with t.Skip()

Tracking

  • flaky-tests/audit-log.md — timestamped log of every CI run result
  • flaky-tests/<TestName>.md — one file per flaky test with failure details

This PR will never be merged. It exists as a living audit trail.

Add flaky-tests/audit-log.md to track CI runs on this branch.
Any test failure here is a flaky test since no test logic
has been modified from master.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Detected flaky test in ci run 24314068155. The subtest
maxT_well_after_lookback_boundary failed under -race on amd64 but
passed on arm64 and without -race. Root cause is a timing sensitivity
where time.Now() drifts between test setup and code under test.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CharlieTLe and others added 5 commits April 12, 2026 12:47
CI run 24314518781 completed with all jobs passing.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Detected flaky test in ci run 24314927948. TestQueueConcurrency in
pkg/scheduler/queue timed out after 30m on arm64 with -race. Root
cause is a deadlock where dequeueRequest blocks forever on a channel
when the queue is drained or deleted by concurrent goroutines.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
… occurrence #2

CI run 24315645679: same timing-sensitive test failed again on amd64
with -race. This is occurrence #2 of 3 before auto-skip.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…ndary and TestQueueConcurrency

Auto-skip flaky tests after first occurrence:

- TestDistributorQuerier_QueryIngestersWithinBoundary: timing-sensitive
  test where time.Now() drifts between test setup and code under test
  (2 occurrences on amd64 with -race)

- TestQueueConcurrency: deadlock where dequeueRequest blocks forever
  when queue is drained/deleted by concurrent goroutines (1 occurrence
  on arm64 with -race, 30m timeout)

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…aryMode

New flaky test from ci run 24316060467: integration test failed on
arm64 due to Docker container (e2e-cortex-test-consul) disappearing
mid-test. Transient CI infrastructure issue, not a code bug.

Also updated audit log with run #5 and #6 results.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 12, 2026
CharlieTLe and others added 18 commits April 13, 2026 13:32
New flaky test from ci run 24316771541: integration test failed on
amd64 due to widespread Docker container disappearance mid-test.
Same transient CI infrastructure pattern as
TestQuerierWithBlocksStorageRunningInSingleBinaryMode.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…rsWithinBoundary

Upstream fix (cortexproject#7419) injected a clock to eliminate timing drift.
Removing our t.Skip since the root cause is properly fixed.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
New flaky tests from ci run 24365433548:

- TestRuler_rules_limit: alert state race — test expects "unknown"
  but ruler evaluates to "inactive" before assertion under -race

- TestParquetFuzz: non-deterministic fuzz test, 1 of N random queries
  failed

Also: requires_docker job failed due to Docker install infra issue
(not a test failure).

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…entionPeriod and TestRuler_rules

New flaky tests from ci run 24366476213:

- TestBlocksCleaner_ShouldRemoveBlocksOutsideRetentionPeriod:
  assertion failures on arm64 no-race in pkg/compactor

- TestRuler_rules: same alert state race as TestRuler_rules_limit,
  "state":"unknown" vs "state":"inactive" in configs-db job

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…one job

CI run 24524134384: all test jobs passed. Only failure was
integration_overrides Docker install hitting Docker Hub rate limit
(toomanyrequests). Not a test issue.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CI run 24525138727: all test jobs passed. Only failure was
requires_docker Docker install hitting Docker Hub rate limit.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
New flaky test from ci run 24526116820: integration test on arm64
expected HTTP 422 but got 500 — likely a race in limit-checking
initialization.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CI run 24527178425: all jobs passed including all integration tests.
No flaky tests detected, no infra failures.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CI run 24528187016: all jobs passed. Two consecutive fully clean runs
with all 8 skipped flaky tests and no new failures.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CI run 24529191249: all jobs passed. Three consecutive fully clean
runs. The flaky test audit has stabilized with 9 flaky tests
identified (1 fixed upstream, 8 skipped).

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
New flaky test from ci run 24534699120: token spread distance error
0.01097 barely exceeded the 0.01 threshold. Non-deterministic due to
randomized token generation.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
New flaky test from ci run 24536561113: non-deterministic fuzz test
on arm64 with Docker container disappearance.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CharlieTLe and others added 5 commits April 16, 2026 18:02
New flaky test from ci run 24537394343: race condition in concurrent
push with zero timestamp out-of-bounds error on arm64.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
CI run 24576043912: all jobs passed after merging upstream/master
(cortexproject#7424 regex resolver fix, cortexproject#7429 memberlist WatchPrefix fix).

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…WithSharding

New flaky test from ci run 24577873547: ring membership race on arm64
with -race. Instance goes missing and gets re-added, causing assertion
failure.

Signed-off-by: Charlie Le <charlie.le@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant