root b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)

The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 06:22:21 -05:00

16 KiB

Raw Blame History

STATE OF PLAY — Lakehouse-Go

Last verified: 2026-04-30 ~05:50 CDT Verified by: live probes + just verify PASS + reality test PASS (7/8 lift), not memory.

Read this FIRST. When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at /home/profit/lakehouse/ is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

VERIFIED WORKING RIGHT NOW

Substrate (G0 + G1 family)

13 service binaries under cmd/ plus 2 driver scripts under scripts/staffing_* build into bin/. 18 smoke scripts all PASS. just verify (vet + 30 packages × short tests + 9 core smokes) green in ~31s wall.

Binary	Port	What
`gateway`	3110	reverse proxy, single OpenAI-compat-style edge
`storaged`	3211	S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002)
`catalogd`	3212	Parquet manifests, ADR-020 idempotent register
`ingestd`	3213	CSV → Parquet → catalogd, content-addressed keys
`queryd`	3214	DuckDB SELECT over Parquet via httpfs
`vectord`	3215	HNSW indexes (coder/hnsw), persistence to storaged
`embedd`	3216	Ollama-backed embedder w/ LRU cache
`pathwayd`	3217	Mem0 ops (Add/Update/Revise/Retire/History/Search)
`matrixd`	3218	Multi-corpus retrieve+merge + relevance + downgrade + playbook
`observerd`	3219	Witness loop, workflow runner with DAG executor
`chatd`	3220	LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi
`mcpd`	—	MCP SDK port (Bun mcp-server replacement)
`fake_ollama`	—	Test fixture (used by `g2_smoke_fixtures.sh`)

Matrix indexer — all 5 SPEC §3.4 components shipped

Corpus builders (internal/corpusingest)
Multi-corpus retrieve+merge (matrixd /matrix/search)
Relevance filter (internal/matrix/relevance.go 376 LoC + 289 LoC test)
Strong-model downgrade gate (internal/matrix/downgrade.go, reads cfg.Models.WeakModels after Phase 2)
Playbook memory + boost (internal/matrix/playbook.go, learning loop)

Pathway memory (Mem0 substrate)

Full ADR-004 surface shipped. Cycle-detection + retired-trace exclusion proven by tests: TestHistory_CycleDetected, TestRetire_ExcludedFromSearch, TestRevise_ChainOfThree_BackwardWalk. JSONL append-only persistence with corruption tolerance.

Observer + workflow runner

observerd ring buffer + JSONL persistence
Workflow DAG executor (Archon-style) with 5 native modes wired: matrix.relevance, matrix.downgrade, matrix.search, distillation.score, drift.scorer. Plus fixture.echo / fixture.upper for runner mechanics smokes.

Distillation + drift

E (partial) at 57d0df1 — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
F (first slice) at be65f85 — drift quantification, scorer drift first.

chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)

Multi-provider LLM dispatcher routing /v1/chat by model-name prefix or :cloud suffix:

Prefix / suffix	Provider	Auth
`ollama/<m>` or bare	`ollama` (local)	none
`ollama_cloud/<m>` or `<m>:cloud`	`ollama_cloud`	Bearer (OLLAMA_CLOUD_KEY)
`openrouter/<v>/<m>`	`openrouter`	Bearer (OPENROUTER_API_KEY)
`opencode/<m>`	`opencode`	Bearer (OPENCODE_API_KEY)
`kimi/<m>`	`kimi`	Bearer (KIMI_API_KEY)

All 5 keys live in /etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}.

Request.Temperature is *float64 (pointer) — Anthropic 4.7 deprecates temperature entirely, so we omit the field when caller doesn't set it.

Model tier registry

lakehouse.toml [models] names model IDs by tier so swaps are 1-line:

local_fast       = "qwen3.5:latest"
local_judge      = "qwen3.5:latest"
cloud_judge      = "kimi-k2.6:cloud"
cloud_review     = "qwen3-coder:480b"
frontier_review  = "openrouter/anthropic/claude-opus-4-7"
frontier_arch    = "openrouter/moonshotai/kimi-k2-0905"
frontier_free    = "opencode/claude-opus-4-7"
weak_models      = ["qwen3.5:latest", "qwen3:latest"]   # matrix.downgrade bypass

Callers read cfg.Models.LocalJudge etc. instead of literal strings. playbook_lift harness, matrix.downgrade, and observerd's MatrixDowngradeWithWeakList factory all migrated.

Code health

go vet ./... → 0 warnings, 0 errors
go test -short ./... → all green, 349 test functions
just verify → PASS (vet + tests + 9 smokes) in ~31s
18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)

Latest scrum: 2026-04-30 cross-lineage wave

Composite 50/60 at scrum2 head c7e3124 (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own /v1/chat; 2 BLOCKs + 2 WARNs landed as fixes (0efc736); reusable driver at scripts/scrum_review.sh.

Reality test PASSED — `playbook_lift_001` (2026-04-30 ~05:50 CDT)

The 5-loop substrate's load-bearing gate (per project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for") is verified.

Metric	Value
Queries	21 (staffing-domain, 7 categories)
Cold-pass discoveries (judge-best ≠ top-1)	8
Warm-pass lifts (recorded playbook → top-1)	7 / 8 (87.5%)
Boosts triggered	9
Mean Δ top-1 distance	-0.053 (warm consistently closer)
OOD honesty (dental/RN/SWE queries)	rated 1, no fake matches
Cross-corpus boosts	confirmed (e- ↔ w- swaps in lifts)

Evidence: reports/reality-tests/playbook_lift_001.{json,md}. Per the report's rubric (lift ≥ 50% = matrix doing real work), 87.5% means we're well past validation.

Harness expansion (2026-04-30 ~05:30 CDT)

scripts/playbook_lift.sh rewritten from a 5-daemon stripped harness to the full 10-daemon prod-realistic stack (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:

#	Fix	Lock
1	driver→matrixd: `query` → `query_text` field name	`cmd/matrixd/main_test.go` TestPlaybookRecord_OldFieldNameRejected
2	harness toml missing `[s3]` block	inline comment in `scripts/playbook_lift.sh`
3	harness→queryd: `q` → `sql` field name	`cmd/queryd/main_test.go` TestHandleSQL_WrongFieldName_400
4	5→10 daemon boot order	inline comment + dep-ordered launch
5	SQL surface probe (3-row CSV → COUNT=3)	`[lift] ✓ SQL surface probe passed` assertion
6	`candidates` corpus was SWE-tech, not staffing	swapped to `ethereal_workers.parquet` (10K rows, real staffing schema, "e-" id prefix)
7	`qwen3.5:latest` is vision-SSM 256K-ctx → 30s/judge	reverted `local_judge` to `qwen2.5:latest` (1s/judge, 30× faster)

R-005 closed (2026-04-30 ~05:35 CDT)

Four new cmd/<bin>/main_test.go files — chi router-level contract tests:

cmd/matrixd/main_test.go (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
cmd/queryd/main_test.go (extended) — wrong-field-name drift detector
cmd/pathwayd/main_test.go (102 lines) — 9 routes + add round-trip + retire-nonexistent
cmd/observerd/main_test.go (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400

go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. R-005 from prior STATE OPEN list is closed.

DO NOT RELITIGATE

Ratified ADRs (`docs/DECISIONS.md`)

ADR-001: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. 6 sub-decisions, all final.
ADR-002: storaged per-prefix PUT cap (4 GiB for _vectors/, 256 MiB elsewhere) — implemented at 423a381. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
ADR-003: Inter-service auth = Bearer + IP allowlist, opt-in via cfg.Auth.Token. Wiring deferred to Sprint 1 but the design is locked — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
ADR-004: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque json.RawMessage content. Implemented in internal/pathway/.

Today's scrum dispositions (2026-04-30)

Verbatim verdicts at reports/scrum/_evidence/2026-04-30/verdicts/. Disposition table: reports/scrum/_evidence/2026-04-30/disposition.md.

Real findings, all fixed in 0efc736:

B-1 (Opus+Kimi convergent): ResolveKey 3-arg API → 2-arg
B-2 (Opus+Kimi convergent): handleProviders direct map lookup, drop synthesis-via-Resolve
B-3 (Opus single, trace-verified): OllamaCloud.Chat strips ollama_cloud/ prefix correctly
B-4 (Opus single): Ollama done_reason surfaced to FinishReason

False positives dismissed (3, documented):

FP-A1: Kimi misread TestMaybeDowngrade_WithConfigList assertion
FP-A2: Qwen claimed nil-deref in MaybeDowngrade that doesn't exist
FP-C1: Opus claimed qwen3.5:latest doesn't exist on Ollama hub (it does on this box's local install)

Session frame (don't redo)

The Rust legacy is maintenance-only until Go reaches feature parity. Don't propose ports of components already shipped here.
The matrix indexer 5/5 components are shipped. Don't propose to "build the matrix indexer" — it's done.
qwen3.5:latest IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
temperature is omitted for Anthropic 4.7 (handled by Request.Temperature *float64); don't re-add it.
chatd-smoke runs with all cloud providers disabled intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).

OPEN — what's not done yet

Item	What	When to act
Reality test v2: paraphrase queries	The 21 verbatim queries in `tests/reality/playbook_lift_queries.txt` exercise verbatim replay only. The interesting case is similar but not identical queries hitting a recorded playbook — does the cosine on `query_text` find the playbook hit? Add a paraphrase pass and measure.	After J wants to push the harness past v1 baseline.
Q15 boost-math edge case	"Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so not promoted. Documented in caveat #2. Either (a) accept the math limit, or (b) tier scores so judge-best-found-deep gets score>1.0. Open design call.	When a second reality run shows the same edge case persisting.
Sprint 4 — deployment	No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan.	When G5 cutover is on the table.
ADR-005 — observer fail-safe semantics	Observer ported but the upstream "verdict:accept on crash" anti-pattern still has no Go-side decision locked. Doc-only, ~30 min.	Before observer is wired into production paths.
ADR-006 — auth posture for non-loopback deploy	Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr.	Required before any Go binary binds non-loopback in prod.
chatd fixture-mode storage half	`g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully.	When CI box without MinIO is needed.
Distillation full port	`57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported.	When distillation is needed for production.
Drift full quantification	`be65f85` is "scorer drift first." Full distribution-drift signal underspecified everywhere — research gap, not a port.	Open research item.

RECENT VERIFIED WAVE (2026-04-30)

05273ac..e4ee002 — 4 phases + scrum + tooling, all gate-tested.

SHA	What
`ec1d031`	Phase 1: `[models]` tier config (additive, no callers migrate)
`622e124`	Phase 2: `matrix.downgrade` reads `cfg.Models.WeakModels`
`848cbf5`	Phase 3: `playbook_lift` harness defaults from config
`05273ac`	Phase 4: chatd + 5 providers (1,624 LoC)
`0efc736`	Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review
`e4ee002`	`scripts/scrum_review.sh` — reusable 3-lineage driver

Plus on Rust side (8de94eb, 3d06868): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

RUNTIME CHEATSHEET

# Verify everything green
cd /home/profit/golangLAKEHOUSE
just verify                                            # vet + tests + 9 core smokes (~31s)
just doctor                                            # dep probe (go/gcc/minio/ollama/secrets)

# Boot the chat dispatcher (Phase 4)
nohup ./bin/chatd   -config lakehouse.toml > /tmp/chatd.log   2>&1 & disown
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq  # all 5 providers should report true

# Test a chat call to each lineage
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
  curl -sS -X POST http://127.0.0.1:3110/v1/chat \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
    | jq -c '{model,provider,content}'
done

# Run the scrum on a diff
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/

# Domain smokes (not in `just verify`)
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
  bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
done

VISION — what we're actually building

J's framing (canonical at /root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via frontier_* tier. The hot path runs on local qwen3.5:latest given:

Pathway memory — what we tried before, how it went (Mem0 substrate ✓)
Matrix indexer — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
Observer — watches each run, refines configs (not prompts) toward good pathways

Successful runs get rated and distilled back into the playbook. Each iteration the playbook gets denser, runs get cheaper, results get better. Drift in the distilled playbook is a measured signal, not vibes.

The single load-bearing gate: "the playbook + matrix indexer must give the results we're looking for." Throughput, scaling, code elegance are all secondary. The playbook_lift reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).

When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.

SIBLING TOOLS (separate repos, intentional integration target later)

local-review-harness at git.agentview.dev/profit/local-review-harness (also SMB-mounted at /home/profit/share/local-review-harness-full-md/). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases C–E (Ollama LLM review, validation, memory) pending.

Cross-pollination plan when both stabilize:

Replace harness's internal/llm/ollama.go with a chatd /v1/chat client → frontier judges via config toggle
Feed harness findings into Lakehouse pathway memory as a drift signal
Treat harness's .memory/known-risks.json as a matrix-indexer corpus

Detail at docs/SPEC.md §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.

16 KiB Raw Blame History Unescape Escape