feat(opencode): give @make a concrete test-smell checklist

Real-world observation: @make struggles when @test sets up tests
incorrectly because the existing escalate: test_design trigger is
described abstractly ("test seems to demand wrong thing"). When @make
sees an unfamiliar smell, it tends to attempt implementation, fail,
attempt again, and only escalate after burning 2-3 cycles. The
protocol exists; the recognition criteria don't.

Restructure Entry Validation step 5 into a named "Test triage" step
with a concrete checklist that fires *before* any implementation
attempt. Four categories of smells:

- **Mocking smells:** mocks the SUT, >2 mocks, mock-call-as-primary
  assertion, internal-boundary mocking
- **Structural-only smells:** variant counts, type ascriptions,
  function-pointer coercion, struct-literal-with-field-reads,
  stub-first no-panics (mirrors @test.md's anti-patterns)
- **Wrong-target smells:** asserts on private state / log strings,
  demands contradicting spec, physically impossible demands
- **Setup smells:** fixtures bypassing production validation,
  wrong-module imports, references to nonexistent infrastructure

Iteration Limits step 5 now cross-references the same checklist
instead of restating abstract criteria, so both gates apply the same
recognition rules with a single source of truth.

A "NOT for" caveat prevents over-eager escalation: when the test is
fine but the implementation is just hard, that's not a smell, that's
the test doing its job.

The checklist is inlined (not pulled from @test.md at runtime) because
subagents have separate contexts. Periodic manual sync between
@make.md's checklist and @test.md's anti-patterns is acceptable —
they shouldn't drift much in practice.

Refs: config/opencode/agents/test.md (anti-patterns + structural-only
list it mirrors), config/opencode/workflow-design.md ADR-19 (unified
Implementation Incomplete diagnosis path)
This commit is contained in:
Harald Hoyer 2026-05-08 15:16:33 +02:00
parent 56713cd7b8
commit 267c05b107

View file

@ -331,7 +331,7 @@ If tests fail or verification doesn't pass:
2. **Context/spec issues** — Stop immediately and report; don't guess
3. **Code issues** — Attempt fix (max 2-3 attempts if making progress)
4. **Flaky/infra issues** — Stop and report with diagnostics
5. **Test-design suspicion** — If after 12 attempts the test seems to demand production code that contradicts the spec, asserts on internal state that shouldn't be observable, mocks an internal boundary instead of the external one, or otherwise looks like it's testing the wrong thing — **stop and report with `escalate: test_design`** in the Blocking Issue section. Do not modify the test file yourself; the caller will route to `@check` for diagnosis and `@test` for redesign per the workflow's Phase 7 escalation.
5. **Test-design suspicion** — If you reach this step *after* implementation attempts, you missed the smell at Entry Validation. Re-read the test against the **Test-design smell checklist** in the TDD Mode → Entry Validation section above. If any smell now matches (often it's the "wrong-target" or "setup" categories that only become visible once you've tried to satisfy them), **stop and report with `escalate: test_design`** in the Blocking Issue section, naming the specific smell. Do not modify the test file yourself; the caller will route to `@check` for diagnosis and `@test` for redesign per the workflow's Phase 7 escalation.
6. **Task-scope suspicion** — If after 12 attempts you find that the AC realistically require modifying files not listed in your "Files to Modify," or the AC mix multiple distinct concerns that don't fit one coherent change (e.g. a new type *and* its registration site *and* a new system using it), the task is over-scoped — **stop and report with `escalate: split_needed`** in the Blocking Issue section. State concretely which file(s) outside your modify list you'd need, or which concerns the task is mixing. Do not silently expand scope; the caller will route to `@check` for diagnosis and (per the workflow's ADR-21) the run will abort to the Failure Handler so the user can re-plan from scratch.
The `escalate:` flag is a *hint* to the caller's diagnosis routing — `@check` is the authority that confirms or rejects it. Reporting `escalate: split_needed` doesn't guarantee the run aborts; if `@check` decides the task is sound and the issue is in tests or production code, the diagnosis will route back to a normal `test_design` or `production_logic` verdict.
@ -418,11 +418,41 @@ When the caller provides pre-written failing tests from `@test`:
### Entry Validation
1. Run the provided tests using the exact command from the handoff.
2. Confirm they fail (RED). Compare against the expected failing tests and failure codes from the handoff.
3. If tests PASS before implementation: STOP. Report anomaly to caller — behavior already exists, task spec may be wrong.
4. If tests fail for wrong reason (TEST_BROKEN): STOP. Report to caller for test fixes.
5. If test quality concerns (wrong assertions, testing mocks, missing edge cases): report with details. Caller routes to `@check` for diagnosis, then to `@test` for fixes.
3. **PASS-before-implementation** — If tests pass without any production-code change: STOP. Report anomaly to caller — behavior already exists, task spec may be wrong.
4. **Wrong-reason failure** If tests fail for the wrong reason (TEST_BROKEN — e.g. import error, syntax error, fixture exception unrelated to the AC): STOP. Report to caller for test fixes.
5. **Test triage** (do this *before* attempting any implementation) — read each test file and evaluate setup quality against the checklist below. Any single match is sufficient to escalate. **Stop and report with `escalate: test_design`** in the Blocking Issue section, naming the specific smell and which test exhibits it. Do not start implementing.
**Escalation ownership:** You diagnose and report test issues. You do NOT edit test files. The caller routes to `@check` (diagnosis) → `@test` (fixes) → back to you.
#### Test-design smell checklist
Recognize these patterns by reading the test file before you write any production code. If you spot one, the test is set up wrong — escalate; the caller routes to `@check``@test` for redesign.
**Mocking smells:**
- Mocks the system-under-test itself (the function/method/module the test claims to verify).
- Asserts on mock-call counts or argument matchers as the *primary* assertion, with no real-behavior assertion to back it up. ("`mock.foo.assert_called_with(x)`" is a means, not an end.)
- More than 2 mocks in a single test — usually means the production code's collaborator graph has been mocked rather than the external boundary.
- Mocks an internal boundary (a private helper, a same-crate module) instead of the external one (network, filesystem, time, RNG).
**Structural-only smells (the test compiles but doesn't exercise behavior):**
- `assert_eq!(std::mem::variant_count::<X>(), N)` or similar enum/struct shape checks — refactor-tripwire, not behavior.
- `let _: TypeName = …;` / `let _: fn(…) -> _ = my_fn;` — type ascriptions tell you the symbol exists, not what it does.
- `Box::new(my_fn)` / `&my_fn as &dyn Fn(…)` — coercing a function pointer is not calling it.
- Struct-literal construction (`Foo { a: 1, b: 2 }`) followed only by field re-reads — exercises field access, not methods.
- In stub-first runs (Rust ADR-7): tests that pass without panicking on `todo!()` — by definition no test actually called the stub.
**Wrong-target smells:**
- Asserts on internal/private state that the production code shouldn't expose (`assert_eq!(obj._private_counter, 5)`).
- Asserts on log-output strings as a stand-in for behavior (use the actual return value or side effect).
- Tests demand production code that contradicts the task spec — the test wants a return type, signature, or side effect different from what the AC describes.
- Tests demand production code that is *physically impossible* (e.g. requires reading a value before it's been written, or accessing a field that was never declared).
**Setup smells:**
- Fixtures construct state in a way that doesn't match how production code expects to receive it (e.g. test inserts a row directly bypassing the validation the production code requires).
- Test imports refer to symbols at paths that don't match where the production code lives (the test is testing the wrong module).
- Test file uses fixtures or helpers that don't exist anywhere in the codebase — the test relies on infrastructure that was never built.
**One thing this list is NOT for:** legitimate cases where the test exposes a *production-code* gap (the implementation needs to be different to make the test pass). That's not a test smell — that's the test doing its job. Escalate `test_design` only when the test setup itself is wrong, not when the implementation is just hard.
**Escalation ownership:** You diagnose and report test issues. You do NOT edit test files. The caller routes to `@check` (diagnosis — `@check` confirms or rejects your `test_design` hint) → `@test` (fixes) → back to you for fresh entry validation.
### Implementation
6. Write minimal code to make the failing tests pass.