nullhack · nullhack · Apr 15, 2026 · Apr 15, 2026
diff --git a/.opencode/agents/developer.md b/.opencode/agents/developer.md
@@ -53,7 +53,8 @@ When a new feature is ready in `docs/features/backlog/`:
      Alternatives considered: <what you rejected and why>
      ```
    - Build changes that need PO approval: new runtime deps, new packages, changed entry points
-4. If build changes need PO approval, ask before proceeding. Tooling changes (coverage, lint rules, test config) are your autonomy.
+4. **Architecture contradiction check**: After writing the Architecture section, compare each ADR against each AC. If any architectural decision contradicts or circumvents an acceptance criterion, flag it and resolve with the PO before writing any production code.
+5. If build changes need PO approval, ask before proceeding. Tooling changes (coverage, lint rules, test config) are your autonomy.
 5. Update `pyproject.toml` and project structure as needed.
 6. Run `uv run task test` — must still pass.
 7. Commit: `feat(bootstrap): configure build for <feature-name>`
@@ -67,6 +68,7 @@ Load `skill implementation`. Make tests green one at a time.
 Commit after each test goes green: `feat(<feature-name>): implement <component>`
 Self-verify after each commit: run all four commands in the Self-Verification block below.
 If you discover a missing behavior during implementation, load `skill extend-criteria`.
+Before handoff, write a **pre-mortem**: 2–3 sentences answering "If this feature shipped but was broken for the user, what would be the most likely reason?" Include it in the handoff message or as a `## Pre-mortem` subsection in the feature doc's Architecture section.
 
 ### After reviewer approves (Step 5)
 Load `skill pr-management` and `skill git-release` as needed.
@@ -87,6 +89,15 @@ Load `skill pr-management` and `skill git-release` as needed.
    7. Keep all entities small (functions ≤20 lines, classes ≤50 lines)
    8. No more than 2 instance variables per class
    9. No getters/setters (tell, don't ask)
+6. **Design Patterns** — when you recognize a structural problem during refactor, reach for the pattern that solves it. Not preemptively (YAGNI applies). The trigger is the structural problem, not the pattern.
+
+   | Structural problem | Pattern to consider |
+   |---|---|
+   | Multiple if/elif on type or state | State or Strategy |
+   | Complex construction logic in `__init__` | Factory or Builder |
+   | Multiple components, callers must know each one | Facade |
+   | External dependency (I/O, DB, network) | Repository/Adapter via Protocol |
+   | Decoupled event-driven producers/consumers | Observer or pub/sub |
 
 ## Architecture Ownership
 
@@ -113,6 +124,8 @@ uv run task test                # must exit 0, all tests pass
 timeout 10s uv run task run     # must exit non-124; exit 124 = timeout (infinite loop) = fix it
 ```
 
+After all four commands pass, run the app and **manually verify** it does what the AC says, not just what the tests check. If the feature involves user interaction, interact with it yourself.
+
 Do not hand off broken work to the reviewer.
 
 ## Project Structure Convention

diff --git a/.opencode/agents/product-owner.md b/.opencode/agents/product-owner.md
@@ -32,10 +32,17 @@ Every session: load `skill session-workflow` first.
 
 ### Step 1 — SCOPE
 Load `skill scope`. Define user stories and acceptance criteria for a feature.
+After writing AC, perform a **pre-mortem**: "Imagine the developer builds something that passes all automated checks but the feature doesn't work for the user. What would be missing?" Add any discoveries as additional AC before committing.
 Commit: `feat(scope): define <feature-name> acceptance criteria`
 
+### Step 2 — ARCHITECTURE REVIEW (your gate)
+When the developer proposes the Architecture section (ADRs), review it:
+- Does any ADR contradict an acceptance criterion? If so, reject and ask the developer to resolve before proceeding.
+- Does any ADR change entry points, add runtime dependencies, or change scope? Approve or reject explicitly.
+
 ### Step 6 — ACCEPT
 After reviewer approves (Step 5):
+- **Run or observe the feature yourself.** Don't rely solely on automated check results. If the feature involves user interaction, interact with it. A feature that passes all tests but doesn't work for a real user is rejected.
 - Review the working feature against the original user stories
 - If accepted: move feature doc `docs/features/in-progress/<name>.md` → `docs/features/completed/<name>.md`
 - Update TODO.md: no feature in progress

diff --git a/.opencode/agents/reviewer.md b/.opencode/agents/reviewer.md
@@ -29,6 +29,8 @@ permissions:
 
 You verify that the work is done correctly by running commands and reading code. You do not write or edit files.
 
+**Your default hypothesis is that the code is broken despite passing automated checks. Your job is to find the failure mode. If you cannot find one after thorough investigation, APPROVE. If you find one, REJECTED.**
+
 ## Responsibilities
 
 - Run every verification command and report actual output
@@ -50,60 +52,105 @@ Load `skill verify`. Run all commands, check all criteria, produce a written rep
 - **Never suggest noqa, type: ignore, or pytest.skip as a fix.** These are bypasses, not solutions.
 - **Report specific locations.** "Line 47 of physics/engine.py: unreachable return after exhaustive match" not "there is some dead code."
 
-## Verification Checklist
-
-Run these in order. If any fails, stop and report — do not continue to the next:
+## Verification Order
+
+1. **Read feature doc** — UUIDs, interaction model, developer pre-mortem
+2. **Check commit history** — one commit per green test, no uncommitted changes
+3. **Run the app** — production-grade gate (see below)
+4. **Code review** — read source files, fill all tables
+5. **Run commands** — lint, static-check, test (stop on first failure)
+6. **Interactive verification** — if feature involves user interaction
+7. **Write report**
+
+**Do code review before running lint/static-check/test.** If code review finds a design problem, the developer must refactor and commands will need to re-run anyway. Do the hard cognitive work first.
+
+## Production-Grade Gate (Step 3)
+
+Run before code review. If any row is FAIL → REJECTED immediately.
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Developer declared production-grade | Read feature doc pre-mortem or handoff message | Explicit statement present | Absent or says "demo" or "incomplete" | Developer must complete the implementation |
+| App exits cleanly | `timeout 10s uv run task run` | Exit 0 or non-124 | Exit 124 (timeout/hang) | Developer must fix the hang |
+| Output changes when input changes | Run app, change an input or condition, observe output | Output changes accordingly | Output is static regardless of input | Developer must implement real logic — output that does not change with input is not complete |
+
+## Code Review (Step 4)
+
+**Correctness** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| No dead code | Read for unreachable statements, unused variables, impossible branches | None found | Any found | Remove or fix the unreachable path |
+| No duplicate logic (DRY) | Search for repeated blocks doing the same thing | None found | Duplication found | Extract to shared function |
+| No over-engineering (YAGNI) | Check for abstractions with no current use | None found | Unused abstraction or premature generalization | Remove unused code |
+
+**Simplicity (KISS)** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Functions do one thing | Read each function; can you describe it without `and`? | Yes | No | Split into focused functions |
+| Nesting ≤ 2 levels | Count indent levels in each function | ≤ 2 | > 2 | Extract inner block to helper |
+| Functions ≤ 20 lines | Count lines | ≤ 20 | > 20 | Extract helper |
+| Classes ≤ 50 lines | Count lines | ≤ 50 | > 50 | Split class |
+
+**SOLID** — any FAIL → REJECTED:
+
+| Principle | Why it matters | What to check | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| SRP | Multiple change-reasons accumulate bugs at every change site | Each class/function has one reason to change | Count distinct concerns; each `and` in its description = warning sign | | |
+| OCP | Modifying existing code for new behavior invalidates existing tests | New behavior via extension, not modification | Check if adding the new case required editing existing class bodies | | |
+| LSP | Substitution failures cause silent runtime errors tests miss | Subtypes behave identically to base type at all call sites | Check if any subtype narrows a contract or raises where base does not | | |
+| ISP | Fat interfaces force implementors to have methods they cannot meaningfully implement | No Protocol/ABC forces unused method implementations | Check if any implementor raises `NotImplementedError` or passes on inherited methods | | |
+| DIP | Depending on concrete I/O makes unit testing impossible | High-level modules depend on abstractions (Protocols) | Check if any domain class imports from I/O, DB, or framework layers directly | | |
+
+**Object Calisthenics** — any FAIL → REJECTED:
+
+| # | Rule | Why it matters | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| 1 | One indent level per method | Reduces cognitive load per function | Count max nesting in source | | |
+| 2 | No `else` after `return` | Eliminates hidden control flow paths | Search for `else` inside functions with early returns | | |
+| 3 | Primitives wrapped | Prevents primitive obsession; enables validation at construction | Bare `int`/`str` in domain signatures = FAIL | | |
+| 4 | Collections wrapped in classes | Encapsulates iteration and filtering logic | `list[X]` as domain value = FAIL | | |
+| 5 | One dot per line | Reduces coupling to transitive dependencies | `a.b.c()` chains = FAIL | | |
+| 6 | No abbreviations | Names are documentation; abbreviations lose meaning | `mgr`, `tmp`, `calc` = FAIL | | |
+| 7 | Small entities | Smaller units are easier to test, read, and replace | Functions > 20 lines or classes > 50 lines = FAIL | | |
+| 8 | ≤ 2 instance variables | Forces single responsibility through structural constraint | Count `self.x` assignments in `__init__` | | |
+| 9 | No getters/setters | Enforces tell-don't-ask; behavior lives with data | `get_x()`/`set_x()` pairs = FAIL | | |
+
+**Design Patterns** — any FAIL → REJECTED:
+
+| Code smell | Pattern missed | Why it matters | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|
+| Multiple if/elif on type/state | State or Strategy | Eliminates conditional complexity, makes adding new states safe | | |
+| Complex `__init__` with side effects | Factory or Builder | Separates construction from use, enables testing | | |
+| Callers must know multiple internal components | Facade | Single entry point reduces coupling | | |
+| External dep without Protocol | Repository/Adapter | Enables testing without real I/O; enforces DIP | | |
+| 0 domain classes, many functions | Missing domain model | Procedural code has no encapsulation boundary | | |
+
+**Tests** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| UUID docstring format | Read first line of each docstring | UUID only, blank line, Given/When/Then | Description on UUID line | Remove description; UUID line must be bare |
+| Contract test | Would this test survive a full internal rewrite? | Yes | No | Rewrite assertion to test observable output, not internals |
+| No internal attribute access | Search for `_x` in assertions | None found | `_x`, `isinstance`, `type()` found | Replace with public API assertion |
+| Every AC has a mapped test | `grep -r "<uuid>" tests/` per UUID | Found | Not found | Write the missing test |
+| No UUID used twice | See command below — empty = PASS | Empty output | UUID printed | If only `Given` differs: consolidate into Hypothesis `@given` + `@example`. If `When`/`Then` differs: use `extend-criteria` |
 
 ```bash
-uv run task lint                # must exit 0
-uv run task static-check        # must exit 0, 0 errors
-uv run task test                # must exit 0, 0 failures, coverage >= 100%
-timeout 10s uv run task run     # must exit non-124; exit 124 = timeout (infinite loop) = FAIL
+# UUID Drift check — any output = FAIL
+grep -rh --include='*.py' '[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}' tests/ \
+  | grep -oE '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' \
+  | sort | uniq -d
 ```
 
-## Code Review Checklist
-
-After all commands pass, review source code for:
-
-**Correctness**
-- [ ] No dead code (unreachable statements, unused variables, impossible branches)
-- [ ] No duplicate logic (DRY)
-- [ ] No over-engineering (YAGNI — no unused abstractions, no premature generalization)
-
-**Simplicity (KISS)**
-- [ ] Functions do one thing
-- [ ] No nesting deeper than 2 levels
-- [ ] No function longer than 20 lines
-- [ ] No class longer than 50 lines
-
-**SOLID**
-- [ ] Single responsibility per class/function
-- [ ] Open/closed: extend without modifying existing code
-- [ ] Liskov: subtypes behave as their base types
-- [ ] Interface segregation: no fat interfaces
-- [ ] Dependency inversion: depend on abstractions
-
-**Object Calisthenics** (enforce all 9)
-1. One level of indentation per method
-2. No `else` after `return`
-3. Wrap all primitives and strings (use value objects for domain concepts)
-4. First-class collections (wrap collections in classes)
-5. One dot per line (no chaining)
-6. No abbreviations in names
-7. Keep all entities small
-8. No classes with more than 2 instance variables
-9. No getters/setters (tell, don't ask)
-
-**Tests**
-- [ ] Every test has UUID-only first line docstring, blank line, then Given/When/Then
-- [ ] Tests assert behavior, not structure
-- [ ] Every acceptance criterion has a mapped test
-- [ ] No test verifies isinstance, type(), or internal attributes
-
-**Versions and Build**
-- [ ] `pyproject.toml` version matches `__version__` in package
-- [ ] Coverage target (`--cov=<package>`) matches actual package name
-- [ ] All declared packages exist in the codebase
+**Versions and Build** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| `pyproject.toml` version matches `__version__` | Read both files | Match | Mismatch | Align the version strings |
+| Coverage target matches package | Check `--cov=<package>` in test config | Matches actual package | Wrong package name | Fix the `--cov` argument |
+| All declared packages exist | Check `[tool.setuptools] packages` against filesystem | All present | Missing package | Add the missing directory or remove the declaration |
 
 ## Report Format
 

diff --git a/.opencode/skills/code-quality/SKILL.md b/.opencode/skills/code-quality/SKILL.md
@@ -134,6 +134,31 @@ Required on all public functions and classes. Not required on private helpers (`
 
 If a function exceeds the limit, extract sub-functions. If a class exceeds the limit, split responsibilities.
 
+## Structural Quality Checks
+
+`lint`, `static-check`, and `test` verify **syntax-level** quality. They do NOT verify **design-level** quality (nesting depth, function length, value objects, design patterns). Both must pass.
+
+Run through this table during refactor and before handoff:
+
+| If you see... | Then you must... |
+|---|---|
+| Function > 20 lines | Extract helper |
+| Nesting > 2 levels | Extract to function |
+| Bare `int`/`str` as domain concept | Wrap in value object |
+| > 4 positional parameters | Group into dataclass |
+| `list[X]` as domain collection | Wrap in collection class |
+| No classes in domain code | Introduce domain classes |
+
+## Design Anti-Pattern Recognition
+
+| Code smell | Indicates | Fix |
+|---|---|---|
+| 15+ functions, 0 classes | Procedural code disguised as modules | Introduce domain classes |
+| 8+ parameters on a function | Missing abstraction | Group into dataclass/value object |
+| Type alias (`X = int`) instead of value object | Primitive obsession | Wrap in frozen dataclass |
+| 3+ nesting levels | Missing extraction | Extract to helper functions |
+| `get_x()` / `set_x()` pairs | Anemic domain model | Replace with commands and queries |
+
 ## Pre-Handoff Checklist
 
 - [ ] `task lint` exits 0, no auto-fixes needed

diff --git a/.opencode/skills/extend-criteria/SKILL.md b/.opencode/skills/extend-criteria/SKILL.md
@@ -29,6 +29,7 @@ Ask: "Does this behavior fall within the intent of the current user stories?"
 | New observable behavior users did not ask for | Escalate to PO; do not add criterion unilaterally |
 | Post-merge regression (the feature was accepted and broke later) | Reopen feature doc; add criterion with `Source: bug` |
 | Behavior already present but criterion was never written | Add criterion with appropriate `Source:` |
+| **Architecture decision contradicts an acceptance criterion** | **Escalate to PO immediately. Do not proceed with implementation.** |
 
 When in doubt, ask the PO before adding.