golangLAKEHOUSE/docs/CLAUDE_REFACTOR_GUARDRAILS.md
root a81291e38c proof harness Phase A: scaffolding + canary case green
Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).

What landed:

  tests/proof/
    README.md             how to read a report, layout, modes
    claims.yaml           24 claims enumerated (GOLAKE-001..100)
    run_proof.sh          orchestrator with --mode {contract|integration|performance}
                          and --no-bootstrap / --regenerate-{rankings,baseline}
    lib/
      env.sh              service URLs, report dir, mode, git context
      http.sh             curl wrappers writing per-probe JSON + body + headers
      assert.sh           proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
                          proof_skip — each emits one JSONL record per call
      metrics.sh          start/stop timers, value capture, RSS sampling,
                          percentile compute (for Phase D)
    cases/
      00_health.sh        canary — gateway + 6 services /health → 200,
                          body identifies service, latency < 500ms (21 assertions)
    fixtures/
      csv/workers.csv     spec's 5-row deterministic CSV
      text/docs.txt       4 deterministic vector docs
      expected/queries.json  expected results for the 5 SQL assertions

Wired into the task runner:

  just proof contract       # canary only this commit
  just proof integration    # Phase C
  just proof performance    # Phase D

.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.

Specs landed alongside (J's drops):
  docs/TEST_PROOF_SCOPE.md           the harness contract this implements
  docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys

Verified end-to-end (cached binaries):
  just proof contract        wall < 2s, 21 pass / 0 fail / 0 skip
  just verify                wall 31s, vet + test + 9 smokes still green

Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
  "0\n0" and broke jq --argjson + integer comparison. Fixed via a
  _count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
  write evidence under CASE_ID. Switched to JSONL-file iteration so
  multi-case scripts work and the mapping is faithful.

Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:08:51 -05:00

282 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Claude Refactor Guardrails — Go Lakehouse
## Mission
Continue the Go refactor without recreating the Rust-era complexity.
The Go rewrite exists to make the lakehouse operationally legible:
- small binaries
- clear service boundaries
- gateway-fronted APIs
- smoke-testable behavior
- fast build/run/debug loop
- no accidental framework bureaucracy
The Rust repo had maturity, but also accumulated control-plane weight: validator, auditor, provider routing, iteration loops, UI, truth layers, and agent-era scaffolding. Do not blindly port all of that back.
## Prime Directive
Preserve the Go spine:
```text
gateway
→ storaged
→ catalogd
→ ingestd
→ queryd
→ vectord
→ embedd
Only add complexity when there is measured evidence, a failing smoke, or a documented feature-parity requirement.
Refactor Rules
1. No silent behavior changes
Before changing any service behavior, identify:
current route
current request schema
current response schema
current status-code behavior
current smoke test covering it
If no smoke exists, add one before or with the refactor.
2. Keep service boundaries hard
Do not let services reach into each others internals.
Allowed:
HTTP client calls
shared request/response structs when stable
small internal packages for config/secrets/logging
Avoid:
importing another services implementation package
hidden global state
“just for now” shared mutable registries
circular service knowledge
3. Go is not Rust
Do not imitate Rust patterns mechanically.
Prefer Go-native clarity:
simple structs
explicit errors
small interfaces at the consumer side
context-aware HTTP handlers
table-driven tests
boring package names
no abstraction tax unless repeated 3+ times
4. Validation replaces the borrow checker
Rust caught many problems at compile time. Go will not.
Therefore every refactor must preserve or improve:
input validation
dimension checks
duplicate handling
restart persistence checks
schema drift detection
error status mapping
smoke coverage
5. Performance work must be measured
Do not optimize by vibes.
For each performance change, record:
baseline command
baseline result
changed code path
new result
regression risk
rollback plan
Current known bottleneck:
vectord Add is RWMutex-serialized.
500K vectors: ~35m36s, ~234/sec avg.
GPU around 65%, so embedding is not the only bottleneck.
Do not claim concurrency improvements unless the HNSW library thread-safety is audited or writes are safely batched/sharded.
File/Package Expectations
cmd/
One binary per service. Keep main files thin.
Main should only:
load config
construct dependencies
wire routes
start server
handle shutdown
internal/
Shared code belongs here only when it is genuinely shared.
Good internal packages:
config
secrets
storeclient
catalogclient
gateway routing helpers
logging
request/response contracts
Bad internal packages:
vague “utils”
giant “common”
cross-service god objects
hidden dependency containers
scripts/
Every major behavior needs a runnable smoke.
Smokes are not decoration. They are the replacement nervous system.
Existing smoke pattern must remain:
d1 skeleton/health/gateway
d2 storaged
d3 catalogd
d4 ingestd
d5 queryd
d6 full ingest/query
g1 vectord
g1p vectord persistence
g2 embed → vector add → search
New functionality needs a new smoke or an extension to the closest existing one.
Refactor Checklist
Before editing:
Read README.md
Read docs/PRD.md
Read docs/SPEC.md
Read docs/DECISIONS.md
Read docs/PHASE_G0_KICKOFF.md
Identify affected services
Identify affected smokes
During editing:
Keep public API stable unless explicitly changing it
Keep errors explicit
Keep logs useful but not noisy
Avoid package sprawl
Avoid premature generic interfaces
Preserve restart behavior
Preserve gateway-only acceptance path
After editing:
Run go test ./...
Run relevant smoke script
Run full smoke loop when service contracts changed
Record evidence in a short refactor note
Full smoke loop:
for s in scripts/{d1,d2,d3,d4,d5,d6,g1,g1p,g2}_smoke.sh; do
"$s" || break
done
Refactor Note Format
Create or update:
docs/refactor-notes/YYYYMMDD-<short-name>.md
Use this structure:
# Refactor Note: <name>
## Goal
## Files changed
## Behavior changed
## Behavior preserved
## Tests run
## Smoke results
## Performance before/after
## Risks
## Rollback
Anti-Patterns To Reject
Reject these unless specifically requested:
porting Rust modules 1:1
adding orchestration before service parity
adding AI/agent logic inside core services
making gateway business-aware
hiding failures behind retries
swallowing errors
“temporary” global maps
changing route contracts without smoke updates
adding dependencies for trivial code
optimizing vector ingestion without measurement
rebuilding the Rust bureaucracy in Go clothing
Preferred Next Targets
Prioritize in this order:
Stability of existing Go service contracts
Better smoke coverage
Persistence limits and large object handling
vectord ingestion bottleneck analysis
gateway observability
feature parity with Rust only where needed
UI/agent/auditor layers later, not now
Architectural Position
The Go rewrite should remain the production spine.
The Rust system remains historical reference and possible source for:
validation ideas
audit semantics
provider-routing lessons
prior acceptance criteria
edge cases
But Rust is not the shape to copy.
Go owns the clean operational path.
Rust owns historical scar tissue and high-performance lessons.
Do not confuse the two.
Also add a shorter command prompt when you hand this to Claude Code:
```md
Read `docs/CLAUDE_REFACTOR_GUARDRAILS.md` first.
Then inspect the current Go lakehouse repo and produce a refactor plan only. Do not edit code yet.
Your plan must identify:
1. affected services
2. affected routes
3. affected request/response contracts
4. affected smoke scripts
5. risks of accidentally reintroducing Rust-era complexity
6. exact tests/smokes you will run after changes
Do not port Rust structure blindly. Preserve the Go service spine.