Objective
The current benchmark system only compares against hardcoded thresholds (e.g., cold start ≤ 20000ms). Add historical result storage so benchmarks can detect relative regressions — when performance degrades compared to recent runs, even if absolute thresholds are still met.
Context
Currently in scripts/ci/benchmark-performance.ts:
const THRESHOLDS: Record<string, { target: number; critical: number }> = {
"container_startup_cold": { target: 15000, critical: 20000 },
...
};
These static thresholds will never catch a 50% slowdown that stays under the critical limit, and they don't reflect actual baseline performance of the current hardware/environment. Historical comparison would surface gradual regressions.
With only 5 iterations per metric (see #1864), P95 equals the max value — statistical significance is low. Historical trending across runs compensates for this by building a larger effective sample size over time.
Approach
Use GitHub Actions Cache to store a rolling history of benchmark results:
-
After each successful benchmark run, append the new result to a benchmark-history.json file stored in the Actions Cache (keyed by branch, e.g., benchmark-history-main).
-
Before running benchmarks, restore the history cache to get the rolling baseline.
-
In the benchmark script or a new compare-benchmarks.ts script, compare the current p95 against the rolling mean of the last N (e.g., 10) runs. Flag a regression if current p95 exceeds rollingMean * 1.25 (25% slower).
-
Update the workflow to include cache save/restore steps:
- name: Restore benchmark history
uses: actions/cache/restore@v4
with:
path: benchmark-history.json
key: benchmark-history-${{ github.ref_name }}
restore-keys: benchmark-history-main
- name: Run benchmarks
...
- name: Update benchmark history
run: |
npx tsx scripts/ci/update-benchmark-history.ts benchmark-results.json benchmark-history.json
- name: Save benchmark history
uses: actions/cache/save@v4
with:
path: benchmark-history.json
key: benchmark-history-${{ github.ref_name }}-${{ github.run_id }}
- Add to Step Summary: show trend arrows (↑↓) comparing current results to historical average.
Files to Create/Modify
- Create:
scripts/ci/update-benchmark-history.ts — merges new results into history, trims to last 30 runs
- Modify:
scripts/ci/benchmark-performance.ts — accept optional baseline file for relative regression check
- Modify:
.github/workflows/performance-monitor.yml — add cache save/restore steps and historical comparison
Acceptance Criteria
Objective
The current benchmark system only compares against hardcoded thresholds (e.g., cold start ≤ 20000ms). Add historical result storage so benchmarks can detect relative regressions — when performance degrades compared to recent runs, even if absolute thresholds are still met.
Context
Currently in
scripts/ci/benchmark-performance.ts:These static thresholds will never catch a 50% slowdown that stays under the critical limit, and they don't reflect actual baseline performance of the current hardware/environment. Historical comparison would surface gradual regressions.
With only 5 iterations per metric (see #1864), P95 equals the max value — statistical significance is low. Historical trending across runs compensates for this by building a larger effective sample size over time.
Approach
Use GitHub Actions Cache to store a rolling history of benchmark results:
After each successful benchmark run, append the new result to a
benchmark-history.jsonfile stored in the Actions Cache (keyed by branch, e.g.,benchmark-history-main).Before running benchmarks, restore the history cache to get the rolling baseline.
In the benchmark script or a new
compare-benchmarks.tsscript, compare the current p95 against the rolling mean of the last N (e.g., 10) runs. Flag a regression if current p95 exceedsrollingMean * 1.25(25% slower).Update the workflow to include cache save/restore steps:
Files to Create/Modify
scripts/ci/update-benchmark-history.ts— merges new results into history, trims to last 30 runsscripts/ci/benchmark-performance.ts— accept optional baseline file for relative regression check.github/workflows/performance-monitor.yml— add cache save/restore steps and historical comparisonAcceptance Criteria
Benchmark history is persisted across workflow runs via Actions Cache
Relative regressions (>25% slower than rolling mean) are detected and reported
Step Summary shows trend comparison (current vs. historical average)
History is trimmed to avoid unbounded growth (max 30 entries)
First run (no history) works correctly without errors
Epic: [Long-term] Add performance benchmarking suite #240
Iterations: Bump benchmark iterations from 5 to 30 and wire up workflow input #1864
Daily schedule: Run performance benchmarks daily instead of weekly #1865
Startup analysis: Container startup benchmarks fail: 15+ seconds of idle wait in warm startup path #1850