fix(opencode): close two false-green test loopholes and the orchestrator-as-implementer escape hatch

A workflow run on a Bevy weaving feature exposed two compounding
failures:

1. @test wrote 8 structural-only Rust tests that never invoked
   weave_enemies or trigger_weaving. Every test passed against the
   stub-first @make pre-pass because none of them called the
   stubbed symbols, so todo!() never fired. The body-pass committed
   code that "passed" the suite and silently broke trigger_weaving
   in special stages.

2. @check found the trigger_weaving regression at Phase 8 (final
   review) and the orchestrator decided to "fix them directly"
   rather than dispatching @make — taking the license offered by
   the existing review-loop wording.

Test-quality fixes:

- Phase 3 Test Design now requires each behavior to be expressed as
  an action + observable outcome. Structural facts ("enum has 3
  variants", "struct has these fields") are explicitly disqualified.
- Phase 6 stub-first flow gains a mandatory Panic-coverage check:
  after @test returns, the orchestrator re-runs the test command and
  rejects the output unless every test panics on todo!() (i.e. every
  test exercises at least one stubbed symbol). Any passing test is
  structural-only and routes back to @test.
- Phase 6 decision table gets a "Stub-first run: tests pass with zero
  todo!() panics" row covering the same case.
- @test's Test Philosophy gains an explicit Do-NOT-write list of
  structural-only patterns (variant_count, type ascriptions,
  Box::new(my_fn), struct-literal-only flows, all-pass-on-stubs)
  plus a positive rule: every test must call a function and assert
  on observable outcome, or return NOT_TESTABLE rather than pad the
  suite.

Orchestrator boundary fix:

- Phase 8 review loop replaces "fix them directly (no need to
  re-dispatch @make for small fixes)" with the principle "the
  orchestrator does not write production code; @make does". BLOCK,
  behavioral, correctness, and test-quality findings round-trip
  through @make. Only AST-preserving cosmetic edits (typos in
  comments, trailing newlines) may be applied directly. Compiler-
  detected issues (unused imports, dead code) go through @make.
This commit is contained in:
Harald Hoyer 2026-05-07 09:07:41 +02:00
parent 91e8aab383
commit 91ba5bd272
2 changed files with 20 additions and 4 deletions

View file

@ -149,6 +149,14 @@ This constraint is enforced by a post-step file gate. Violations cause your outp
- Trivial tests (constructor creates object, getter returns value) - Trivial tests (constructor creates object, getter returns value)
- Tests that assert on mock behavior rather than real behavior - Tests that assert on mock behavior rather than real behavior
- Tests requiring excessive mocking (>2 mocks suggests design problem — report it) - Tests requiring excessive mocking (>2 mocks suggests design problem — report it)
- **Structural-only tests** that never invoke the function/method under test. Forbidden patterns:
- `assert_eq!(std::mem::variant_count::<X>(), N)` — variant count is a refactor-tripwire, not behavior.
- `let _: TypeName = …;` / `let _: fn(…) -> _ = my_fn;` — a type ascription that compiles tells you the symbol exists, not what it does.
- `Box::new(my_fn)` / `&my_fn as &dyn Fn(…)` — coercing a function pointer is not calling it.
- Struct-literal construction (`Foo { a: 1, b: 2 }`) followed only by field re-reads — that exercises field access, not the methods that mutate or read state.
- Tests in a stub-first scenario where every test passes without a `todo!()` panic — by definition no test actually called the stub.
**Positive rule — every test MUST exercise behavior.** Each test body must call at least one function or method that is the subject of the task and assert on an *observable outcome* (return value, mutated state, raised error, side effect). If the only thing you can write is a structural assertion, the task is "no test needed" — report it back to the caller as `NOT_TESTABLE` (with a clear reason) rather than padding the suite with type-only tests that produce false-green coverage.
**Follow existing codebase patterns** (per language): **Follow existing codebase patterns** (per language):

View file

@ -57,8 +57,8 @@ The plan should include:
- New files to create - New files to create
- Risks and open questions - Risks and open questions
- **Test Design (conditional — include for non-trivial tasks):** - **Test Design (conditional — include for non-trivial tasks):**
- Key behaviors to verify (what tests should assert) - Key behaviors to verify, expressed as **action + observable outcome** (e.g. *"call `weave_enemies` with t=0.5 → enemy `Transform.translation.x` differs from initial position"*). A structural fact like *"enum has 3 variants"* or *"struct has these fields"* is **not** a behavior — it cannot fail meaningfully and does not exercise the code under test.
- Edge cases and error conditions worth testing - Edge cases and error conditions worth testing (also expressed as actions, not structure)
- What explicitly should NOT be tested (prevents bloat) - What explicitly should NOT be tested (prevents bloat)
- Testability concerns (heavy external deps, GPU-only paths, etc.) - Testability concerns (heavy external deps, GPU-only paths, etc.)
@ -257,6 +257,7 @@ If any non-matching file appears, or any anti-pattern matches: discard `@test` o
| `NOT_TESTABLE` reason `Rust unit-only` | See "Rust unit-only routing" below. **Do not** include test code in the `@make` spec; pass test specs only. | | `NOT_TESTABLE` reason `Rust unit-only` | See "Rust unit-only routing" below. **Do not** include test code in the `@make` spec; pass test specs only. |
| `BLOCKED` | Investigate. May need to revise task spec or plan. | | `BLOCKED` | Investigate. May need to revise task spec or plan. |
| Test passes immediately | Investigate — behavior may already exist. Task spec may be wrong. | | Test passes immediately | Investigate — behavior may already exist. Task spec may be wrong. |
| Stub-first run: tests pass with zero `todo!()` panics | **Structural-only tests.** Every test is asserting type/struct/enum facts without calling any stubbed symbol. Reject the test output and route back to `@test` with a "must exercise the stubbed symbols by calling them" note. Do not let these tests gate Phase 7 — they cannot RED→GREEN, so the body-pass `@make` would commit code with false-green coverage. |
### Rust unit-only routing ### Rust unit-only routing
@ -287,7 +288,11 @@ Rust integration tests live in a separate test crate (`tests/<feature>.rs`) that
- **Dispatch Hygiene still applies:** the stub pass is small and finalized — no draft bodies, no contradictory signatures. - **Dispatch Hygiene still applies:** the stub pass is small and finalized — no draft bodies, no contradictory signatures.
2. Verify `cargo check` passed in `@make`'s output. If not, fix and re-dispatch the stub pass before continuing. 2. Verify `cargo check` passed in `@make`'s output. If not, fix and re-dispatch the stub pass before continuing.
3. Dispatch `@test` as normal. The integration test now compiles; running it panics on `todo!()` at runtime, which is a clean `MISSING_BEHAVIOR` RED with a stack trace — far better than the build-error-RED form. 3. Dispatch `@test` as normal. The integration test now compiles; running it panics on `todo!()` at runtime, which is a clean `MISSING_BEHAVIOR` RED with a stack trace — far better than the build-error-RED form.
4. Continue to Phase 7's body pass (`@make` in TDD mode), where the same files are revisited and the `todo!()` bodies are replaced. 4. **Panic-coverage check (MANDATORY).** After `@test` returns, re-run the test command in the orchestrator and verify that **every test in the new file panics on `todo!()`** (i.e. every test exercises at least one of the stubbed symbols). The decision rule:
- If the test output shows N panics for N tests → proceed to body pass.
- If any test passes without a `todo!()` panic → that test is structural-only (asserting type / variant-count / field facts without calling the stubbed code). **Reject** `@test`'s output and route back with the "Stub-first run: tests pass with zero `todo!()` panics" decision-table verdict. Require `@test` to rewrite each non-panicking test so it actually invokes the stubbed function/method.
- This check is the only thing standing between false-green coverage and the body-pass commit. Skipping it has produced regressions like a system that compiles, "passes" tests, and silently no-ops in production.
5. Continue to Phase 7's body pass (`@make` in TDD mode), where the same files are revisited and the `todo!()` bodies are replaced.
**This routing is mandatory** for new public API in Rust. It is **not** required when the integration test exercises an existing public API (e.g. a behavior fix where the function already exists) — in that case `@test` runs directly and `@make` modifies the body in Phase 7. **This routing is mandatory** for new public API in Rust. It is **not** required when the integration test exercises an existing public API (e.g. a behavior fix where the function already exists) — in that case `@test` runs directly and `@make` modifies the body in Phase 7.
@ -353,7 +358,10 @@ Provide reviewers with:
1. Send implementation to both reviewers 1. Send implementation to both reviewers
2. Merge findings (same precedence rules as Phase 4) 2. Merge findings (same precedence rules as Phase 4)
3. If ACCEPTABLE: proceed to Phase 9 3. If ACCEPTABLE: proceed to Phase 9
4. If issues found: fix them directly (no need to re-dispatch `@make` for small fixes), then re-review 4. If issues found, route per the kind of finding — **the orchestrator does not write production code; `@make` does**:
- **`BLOCK`, behavioral, correctness, or test-quality findings:** build a new `@make` task spec from the finding (apply Dispatch Hygiene, finalized text, no draft answer). Dispatch `@make`. Do **not** fix directly. Every `BLOCK` is by definition behavioral and must round-trip through `@make`.
- **Strictly cosmetic findings** (typo in a comment, missing trailing newline, formatting that does not change the AST or behavior): the orchestrator may fix directly, then re-review. Anything compiler-detected (unused import, dead code) goes through `@make`, since removing it is still a code change.
- When in doubt, dispatch `@make`.
5. **Convergence detection:** same findings twice = stop loop early 5. **Convergence detection:** same findings twice = stop loop early
6. If unresolved after 3 cycles: document blockers, proceed to commit anyway 6. If unresolved after 3 cycles: document blockers, proceed to commit anyway