From 3abadc45744ca77a9f43f3f5f72956f392fd473a Mon Sep 17 00:00:00 2001
From: Alex Gorevski <algore@microsoft.com>
Date: Wed, 18 Feb 2026 21:30:09 -0800
Subject: [PATCH] remove cost optimization analysis doc

---
 docs/ci-cost-optimization.md | 295 -----------------------------------
 1 file changed, 295 deletions(-)
 delete mode 100644 docs/ci-cost-optimization.md

diff --git a/docs/ci-cost-optimization.md b/docs/ci-cost-optimization.md
deleted file mode 100644
index 2485483..0000000
--- a/docs/ci-cost-optimization.md
+++ /dev/null
@@ -1,295 +0,0 @@
-# CI Cost Optimization — February 2026
-
-> **Date:** 2026-02-18
-> **Status:** Implemented
-> **Impact:** ~60-65% reduction in estimated monthly GitHub Actions billable minutes
-
----
-
-## Executive Summary
-
-On February 17, 2026, the ZeroClaw repository consumed **400+ workflow runs** in a single day, totaling an estimated **398 billable minutes** (~6.6 hours). At this rate, monthly costs were projected at **~200 hours/month** (~12,000 billable minutes). This document describes the analysis performed, optimizations implemented, and the revised CI/CD architecture.
-
----
-
-## Analysis Methodology
-
-A Python script (`scripts/ci/fetch_actions_data.py`) was created to programmatically fetch and analyze all GitHub Actions workflow runs from the GitHub API for February 17, 2026. The script:
-
-1. Fetched all completed workflow runs for the date via the GitHub REST API
-2. Grouped runs by workflow name
-3. Sampled job-level timing (up to 3 runs per workflow) to compute per-job durations
-4. Extrapolated to estimate total billable minutes per workflow
-
-### Raw Data Summary (February 17, 2026)
-
-| Rank | Workflow | Runs/Day | Est. Minutes/Day | Primary Trigger |
-|------|----------|----------|-------------------|-----------------|
-| 1 | Rust Package Security Audit | 57 | 102 | Every PR + push |
-| 2 | CI Run | 57 | 70 | Every PR + push |
-| 3 | Performance Benchmarks | 15 | 63 | Every push to main |
-| 4 | Docker | 20 | 63 | PR + push |
-| 5 | PR Labeler | 69 | 20 | Every PR event |
-| 6 | Feature Matrix | 3 | 19 | Push to main |
-| 7 | Integration / E2E Tests | 15 | 17 | Every push to main |
-| 8 | Workflow Sanity | 31 | 16 | Push + PR |
-| 9 | Copilot Code Review | 6 | 14 | Dynamic |
-| 10 | PR Intake Checks | 70 | 7 | Every PR event |
-| 11 | PR Auto Responder | 47 | 4 | PR + issues |
-| | **Total** | **400+** | **~398** | |
-
-### Key Findings
-
-- **15 pushes to main in ~2 hours** on Feb 17, each triggering 6-8 parallel workflows
-- **Security Audit** was the single largest cost driver (102 min/day) with no path filtering
-- **PR Auto Responder** had an **81% failure rate** (38/47 runs failing) — wasting runner time
-- **CodeQL** runs twice daily (not captured in Feb 17 data since it's schedule-only) — adding ~3.5h/week
-- **Benchmarks** ran on every push to main (15x in one day) despite being regression-focused
-- **Dependabot** could generate up to 11 PRs/week, each triggering the full CI cascade
-
----
-
-## Changes Implemented
-
-### 1. Security Audit — Path Filters Added
-
-**File:** `.github/workflows/sec-audit.yml`
-
-**Before:** Ran on every PR and every push to main, regardless of what files changed.
-
-**After:** Only runs when dependency or source files change:
-- `Cargo.toml`, `Cargo.lock`, `src/**`, `crates/**`, `deny.toml`
-
-**Weekly schedule retained** as a safety net for advisory database updates.
-
-**Estimated savings:** ~60-70% of security audit runs eliminated (~30-35 hours/month)
-
-### 2. Performance Benchmarks — Moved to Weekly Schedule
-
-**File:** `.github/workflows/test-benchmarks.yml`
-
-**Before:** Ran on every push to main (15x/day on Feb 17).
-
-**After:** Runs weekly (Monday 3am UTC) + on-demand via `workflow_dispatch`.
-
-**Artifact retention** reduced from 30 days to 7 days to lower storage costs.
-
-**Rationale:** Benchmark regressions don't need per-commit detection. Weekly cadence catches regressions within one development cycle.
-
-**Estimated savings:** ~90% reduction (~28 hours/month)
-
-### 3. Docker PR Smoke Builds — Tightened Path Filters
-
-**File:** `.github/workflows/pub-docker-img.yml`
-
-**Before:** PR smoke builds triggered on any change to `src/**`, `crates/**`, `benches/**`, `firmware/**`, etc.
-
-**After:** PR smoke builds only trigger on Docker-specific files:
-- `Dockerfile`, `.dockerignore`, `docker-compose.yml`, `rust-toolchain.toml`, `dev/config.template.toml`, `.github/workflows/pub-docker-img.yml`
-
-**Push-to-main triggers unchanged** — production Docker images still rebuild on source changes.
-
-**Estimated savings:** ~40-50% fewer Docker smoke builds (~12-15 hours/month)
-
-### 4. CodeQL — Reduced from Twice-Daily to Weekly
-
-**File:** `.github/workflows/sec-codeql.yml`
-
-**Before:** Ran twice daily at 6am and 6pm UTC (14 runs/week), each performing a full `cargo build --workspace --all-targets`.
-
-**After:** Runs weekly (Monday 6am UTC) + on-demand.
-
-**Rationale:** CodeQL for Rust is still maturing. Weekly scans are standard practice for security-focused projects. On-demand dispatch available for urgent scans.
-
-**Estimated savings:** ~12 hours/month
-
-### 5. CI Run — Merged Lint Jobs + Dropped `--release` Build
-
-**File:** `.github/workflows/ci-run.yml`
-
-**Changes:**
-1. **Merged `lint` and `lint-strict-delta` into a single job** — Previously these were two separate parallel jobs, each requiring a full runner spin-up, Rust toolchain install, and cache restore. Now they run sequentially in one job.
-2. **Dropped `--release` flag from smoke build** — `cargo build --release` is 2-3x slower than debug due to optimizations. For a smoke check validating compilation, debug mode is equivalent.
-
-**Estimated savings:** ~1 runner job per CI invocation + faster build times
-
-### 6. Feature Matrix — Weekly-Only + Check-Only
-
-**File:** `.github/workflows/feature-matrix.yml`
-
-**Before:** Ran on every push to main touching `src/**` (3x on Feb 17) with 4 matrix entries, each running both `cargo check` AND `cargo test`.
-
-**After:**
-1. **Removed push trigger** — Now weekly-only (Monday 4:30am UTC) + on-demand
-2. **Removed `cargo test`** — Only runs `cargo check --locked` per feature combination. Tests are already covered by the main CI Run workflow.
-
-**Estimated savings:** ~50-75% of feature matrix compute eliminated
-
-### 7. Lightweight Jobs Moved to `ubuntu-latest`
-
-**Files affected:**
-- `.github/workflows/pr-check-stale.yml`
-- `.github/workflows/pr-check-status.yml`
-- `.github/workflows/pr-auto-response.yml`
-- `.github/workflows/pr-intake-checks.yml`
-- `.github/workflows/pr-labeler.yml`
-- `.github/workflows/sync-contributors.yml`
-
-**Before:** All jobs used `blacksmith-2vcpu-ubuntu-2404` runners, even for lightweight API-only operations (labeling, stale checks, greetings).
-
-**After:** Moved to `ubuntu-latest` (GitHub-hosted runners). These jobs only make API calls and run JavaScript scripts — they don't need Rust toolchains or specialized runners.
-
-**Additional change:** `pr-check-status.yml` schedule reduced from every 12 hours to once daily (8:15am UTC).
-
-### 8. Dependabot — Reduced Frequency and PR Limits
-
-**File:** `.github/dependabot.yml`
-
-**Before:**
-- Cargo: weekly, 5 open PRs max
-- GitHub Actions: weekly, 3 open PRs max
-- Docker: weekly, 3 open PRs max
-- Total: up to 11 Dependabot PRs/week, each triggering full CI
-
-**After:**
-- Cargo: **monthly**, 3 open PRs max, all deps grouped into single PR
-- GitHub Actions: **monthly**, 1 open PR max, all grouped
-- Docker: **monthly**, 1 open PR max, all grouped
-- Total: up to 5 Dependabot PRs/month
-
-**Rationale:** Each Dependabot PR triggers the full CI pipeline. Reducing from weekly to monthly and grouping updates into fewer PRs dramatically reduces CI cascade costs while still keeping dependencies current.
-
----
-
-## Known Issues to Investigate
-
-### PR Auto Responder — 81% Failure Rate
-
-The `pr-auto-response.yml` workflow had 38 failures out of 47 runs on Feb 17. The `contributor-tier-issues` job fires on every issue `labeled`/`unlabeled` event, even when the label is not contributor-tier related. While the JavaScript handler exits early for non-tier labels, the runner still spins up and checks out the repository.
-
-**Recommendations for further investigation:**
-1. Add more specific event filtering at the workflow level to reduce unnecessary runs
-2. Check if the failures are related to GitHub API rate limiting on the search endpoint
-3. Consider whether `continue-on-error: true` should be added to non-critical jobs
-
----
-
-## Revised Workflow Architecture
-
-### Workflow Frequency Overview
-
-| Workflow | Trigger | Runner |
-|----------|---------|--------|
-| **CI Run** | Push to main + PR | Blacksmith |
-| **Sec Audit** | Push/PR (path-filtered) + weekly schedule | Blacksmith |
-| **Sec CodeQL** | Weekly schedule | Blacksmith |
-| **Test E2E** | Push to main | Blacksmith |
-| **Test Benchmarks** | Weekly schedule | Blacksmith |
-| **Test Fuzz** | Weekly schedule | Blacksmith |
-| **Feature Matrix** | Weekly schedule | Blacksmith |
-| **Docker Publish** | Push to main (broad paths) + PR (Docker-only paths) | Blacksmith |
-| **Release** | Tag push only | GitHub-hosted |
-| **Workflow Sanity** | Push/PR (workflow paths only) | Blacksmith |
-| **Label Policy** | Push/PR (policy paths only) | Blacksmith |
-| **PR Labeler** | PR events | **ubuntu-latest** |
-| **PR Intake Checks** | PR events | **ubuntu-latest** |
-| **PR Auto Responder** | PR + issue events | **ubuntu-latest** |
-| **PR Check Stale** | Daily schedule | **ubuntu-latest** |
-| **PR Check Status** | Daily schedule | **ubuntu-latest** |
-| **Sync Contributors** | Weekly schedule | **ubuntu-latest** |
-
-### Weekly Schedule Summary
-
-| Day | Time (UTC) | Workflow |
-|-----|-----------|----------|
-| Monday | 03:00 | Test Benchmarks |
-| Monday | 04:30 | Feature Matrix |
-| Monday | 06:00 | Sec Audit (schedule) |
-| Monday | 06:00 | Sec CodeQL |
-| Sunday | 00:00 | Sync Contributors |
-| Sunday | 02:00 | Test Fuzz |
-| Daily | 02:20 | PR Check Stale |
-| Daily | 08:15 | PR Check Status |
-
-### CI Run Job Dependency Graph
-
-```
-changes ──┬── lint (Format + Clippy + Strict Delta)
-           │     └── test
-           ├── build (Smoke, debug mode)
-           ├── docs-only (fast path)
-           ├── non-rust (fast path)
-           ├── docs-quality
-           └── workflow-owner-approval
-
-All above ──── ci-required (final gate)
-```
-
-### Push-to-Main Trigger Cascade
-
-When code is pushed to `main`, the following workflows trigger:
-
-1. **CI Run** — Always (change-detection gates individual jobs)
-2. **Sec Audit** — Only if `Cargo.toml`, `Cargo.lock`, `src/**`, `crates/**`, or `deny.toml` changed
-3. **Test E2E** — Always
-4. **Docker Publish** — Only if broad source paths changed
-5. **Workflow Sanity** — Only if workflow files changed
-
-**No longer triggered on push:**
-- ~~Performance Benchmarks~~ → Weekly only
-- ~~Feature Matrix~~ → Weekly only
-
----
-
-## Estimated Impact
-
-| Metric | Before | After | Savings |
-|--------|--------|-------|---------|
-| Daily workflow runs | 400+ | ~150-180 | ~55-60% |
-| Daily billable minutes | ~400 min | ~120-150 min | ~60-65% |
-| Monthly billable hours | ~200 hours | ~60-75 hours | ~60-65% |
-| Dependabot PRs/month | ~44 | ~5 | ~89% |
-| CodeQL runs/week | 14 | 1 | ~93% |
-| Benchmark runs/day | ~15 | 0 (weekly: ~1) | ~99% |
-
----
-
-## Rollback Strategy
-
-Each change is isolated to a single workflow file. To rollback any specific optimization:
-
-1. **Revert the specific file** using `git checkout <commit>^ -- <file-path>`
-2. Changes are backward-compatible — no downstream code or configuration depends on the CI schedule/trigger changes
-3. All workflows retain `workflow_dispatch` triggers for manual invocation when needed
-
----
-
-## Validation Checklist
-
-- [ ] Verify CI Run workflow passes on next PR with Rust changes
-- [ ] Verify Security Audit skips docs-only PRs
-- [ ] Verify Docker smoke build only triggers on Dockerfile changes in PRs
-- [ ] Verify weekly schedules fire correctly (check after first Monday)
-- [ ] Monitor PR Auto Responder failure rate after switching to `ubuntu-latest`
-- [ ] Verify Dependabot respects new monthly schedule and limits
-
----
-
-## Files Modified
-
-| File | Change Summary |
-|------|---------------|
-| `.github/workflows/sec-audit.yml` | Added path filters for push and PR triggers |
-| `.github/workflows/test-benchmarks.yml` | Changed to weekly schedule; reduced artifact retention to 7 days |
-| `.github/workflows/pub-docker-img.yml` | Tightened PR path filters to Docker-specific files |
-| `.github/workflows/sec-codeql.yml` | Changed from twice-daily to weekly schedule |
-| `.github/workflows/ci-run.yml` | Merged lint jobs; dropped `--release` from smoke build |
-| `.github/workflows/feature-matrix.yml` | Removed push trigger; removed `cargo test` step |
-| `.github/workflows/pr-check-stale.yml` | Switched to `ubuntu-latest` |
-| `.github/workflows/pr-check-status.yml` | Switched to `ubuntu-latest`; reduced to daily schedule |
-| `.github/workflows/pr-auto-response.yml` | Switched all jobs to `ubuntu-latest` |
-| `.github/workflows/pr-intake-checks.yml` | Switched to `ubuntu-latest` |
-| `.github/workflows/pr-labeler.yml` | Switched to `ubuntu-latest` |
-| `.github/workflows/sync-contributors.yml` | Switched to `ubuntu-latest` |
-| `.github/dependabot.yml` | Changed to monthly schedule; reduced PR limits; grouped all deps |
-| `scripts/ci/fetch_actions_data.py` | New: cost analysis script for GitHub Actions runs |