golangLAKEHOUSE/STATE_OF_PLAY.md
root 87cbd10090 STATE_OF_PLAY: v4 split-threshold result + adjacent-query observation
- Reality test table extends from #001-#003 to #001-#004; v4 row marked
  as "the honest configuration" because OOD cross-pollination is gone.
- Shape B section gains the split-threshold rationale (boost safe at
  loose, inject structurally riskier so tighter).
- Verbatim drop framing rewritten — v3→v4 is configuration evolution,
  not regression.
- OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math
  item (Shape B + split threshold addressed both). Replaced with two
  finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be
  correct, verify with v4 re-judge metric) and liberal-paraphrase
  recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20).
- RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:26:23 -05:00

21 KiB
Raw Blame History

STATE OF PLAY — Lakehouse-Go

Last verified: 2026-04-30 ~07:25 CDT Verified by: live probes + just verify PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.

Read this FIRST. When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at /home/profit/lakehouse/ is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.


VERIFIED WORKING RIGHT NOW

Substrate (G0 + G1 family)

13 service binaries under cmd/ plus 2 driver scripts under scripts/staffing_* build into bin/. 18 smoke scripts all PASS. just verify (vet + 30 packages × short tests + 9 core smokes) green in ~31s wall.

Binary Port What
gateway 3110 reverse proxy, single OpenAI-compat-style edge
storaged 3211 S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002)
catalogd 3212 Parquet manifests, ADR-020 idempotent register
ingestd 3213 CSV → Parquet → catalogd, content-addressed keys
queryd 3214 DuckDB SELECT over Parquet via httpfs
vectord 3215 HNSW indexes (coder/hnsw), persistence to storaged
embedd 3216 Ollama-backed embedder w/ LRU cache
pathwayd 3217 Mem0 ops (Add/Update/Revise/Retire/History/Search)
matrixd 3218 Multi-corpus retrieve+merge + relevance + downgrade + playbook
observerd 3219 Witness loop, workflow runner with DAG executor
chatd 3220 LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi
mcpd MCP SDK port (Bun mcp-server replacement)
fake_ollama Test fixture (used by g2_smoke_fixtures.sh)

Matrix indexer — all 5 SPEC §3.4 components shipped

  1. Corpus builders (internal/corpusingest)
  2. Multi-corpus retrieve+merge (matrixd /matrix/search)
  3. Relevance filter (internal/matrix/relevance.go 376 LoC + 289 LoC test)
  4. Strong-model downgrade gate (internal/matrix/downgrade.go, reads cfg.Models.WeakModels after Phase 2)
  5. Playbook memory: boost + Shape B inject (internal/matrix/playbook.go, learning loop). Shape B (InjectPlaybookMisses, 154a72e) appends recorded answers to results when regular retrieval misses them — closed the paraphrase recovery gap exposed by run #002 (0/2) and validated by run #003 (6/6).

Pathway memory (Mem0 substrate)

Full ADR-004 surface shipped. Cycle-detection + retired-trace exclusion proven by tests: TestHistory_CycleDetected, TestRetire_ExcludedFromSearch, TestRevise_ChainOfThree_BackwardWalk. JSONL append-only persistence with corruption tolerance.

Observer + workflow runner

  • observerd ring buffer + JSONL persistence
  • Workflow DAG executor (Archon-style) with 5 native modes wired: matrix.relevance, matrix.downgrade, matrix.search, distillation.score, drift.scorer. Plus fixture.echo / fixture.upper for runner mechanics smokes.

Distillation + drift

  • E (partial) at 57d0df1 — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
  • F (first slice) at be65f85 — drift quantification, scorer drift first.

chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)

Multi-provider LLM dispatcher routing /v1/chat by model-name prefix or :cloud suffix:

Prefix / suffix Provider Auth
ollama/<m> or bare ollama (local) none
ollama_cloud/<m> or <m>:cloud ollama_cloud Bearer (OLLAMA_CLOUD_KEY)
openrouter/<v>/<m> openrouter Bearer (OPENROUTER_API_KEY)
opencode/<m> opencode Bearer (OPENCODE_API_KEY)
kimi/<m> kimi Bearer (KIMI_API_KEY)

All 5 keys live in /etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}.

Request.Temperature is *float64 (pointer) — Anthropic 4.7 deprecates temperature entirely, so we omit the field when caller doesn't set it.

Model tier registry

lakehouse.toml [models] names model IDs by tier so swaps are 1-line:

local_fast       = "qwen3.5:latest"
local_judge      = "qwen2.5:latest"   # NOT qwen3.5 — vision-SSM 256K-ctx ran 30s/judge in lift loop
cloud_judge      = "kimi-k2.6:cloud"
cloud_review     = "qwen3-coder:480b"
frontier_review  = "openrouter/anthropic/claude-opus-4-7"
frontier_arch    = "openrouter/moonshotai/kimi-k2-0905"
frontier_free    = "opencode/claude-opus-4-7"
weak_models      = ["qwen3.5:latest", "qwen3:latest"]   # matrix.downgrade bypass

Callers read cfg.Models.LocalJudge etc. instead of literal strings. playbook_lift harness, matrix.downgrade, and observerd's MatrixDowngradeWithWeakList factory all migrated.

Code health

  • go vet ./...0 warnings, 0 errors
  • go test -short ./...all green, 349 test functions
  • just verify → PASS (vet + tests + 9 smokes) in ~31s
  • 18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)

Latest scrum: 2026-04-30 cross-lineage wave

Composite 50/60 at scrum2 head c7e3124 (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own /v1/chat; 2 BLOCKs + 2 WARNs landed as fixes (0efc736); reusable driver at scripts/scrum_review.sh.

Reality tests #001#003 — load-bearing gate verified (2026-04-30 ~05:5007:05 CDT)

The 5-loop substrate's load-bearing gate (per project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for") is verified for both verbatim replay and paraphrase queries.

Run Stance Verbatim lift Paraphrase recovery What it proved
playbook_lift_001 boost-only 7/8 (87.5%) not tested Cosine + boost re-rank works for verbatim replay. Substrate live.
playbook_lift_002 boost-only 2/2 0/2 Boost can't promote answers OUT of regular top-K — paraphrase gap exposed.
playbook_lift_003 Shape B (loose 0.5) 2/6 6/6 → top-1 Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE).
playbook_lift_004 Shape B + split threshold (0.5 boost / 0.20 inject) 6/8 (75%) 6/8 (75%) OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration.

Shape B (InjectPlaybookMisses in internal/matrix/playbook.go): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = playbook_hit_distance × BoostFactor. Caller re-sorts + truncates. Documented at playbook.go:22-27 since v0; v3 shipped the implementation. v4 added the split-threshold defense (DefaultPlaybookMaxInjectDistance = 0.20 while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.

OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.

Evidence: reports/reality-tests/playbook_lift_{001,002,003}.{json,md}. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.

v3 → v4 is the configuration evolution. v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

Harness expansion (2026-04-30 ~05:30 CDT)

scripts/playbook_lift.sh rewritten from a 5-daemon stripped harness to the full 10-daemon prod-realistic stack (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:

# Fix Lock
1 driver→matrixd: queryquery_text field name cmd/matrixd/main_test.go TestPlaybookRecord_OldFieldNameRejected
2 harness toml missing [s3] block inline comment in scripts/playbook_lift.sh
3 harness→queryd: qsql field name cmd/queryd/main_test.go TestHandleSQL_WrongFieldName_400
4 5→10 daemon boot order inline comment + dep-ordered launch
5 SQL surface probe (3-row CSV → COUNT=3) [lift] ✓ SQL surface probe passed assertion
6 candidates corpus was SWE-tech, not staffing swapped to ethereal_workers.parquet (10K rows, real staffing schema, "e-" id prefix)
7 qwen3.5:latest is vision-SSM 256K-ctx → 30s/judge reverted local_judge to qwen2.5:latest (1s/judge, 30× faster)

R-005 closed (2026-04-30 ~05:35 CDT)

Four new cmd/<bin>/main_test.go files — chi router-level contract tests:

  • cmd/matrixd/main_test.go (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
  • cmd/queryd/main_test.go (extended) — wrong-field-name drift detector
  • cmd/pathwayd/main_test.go (102 lines) — 9 routes + add round-trip + retire-nonexistent
  • cmd/observerd/main_test.go (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400

go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. R-005 from prior STATE OPEN list is closed.


DO NOT RELITIGATE

Ratified ADRs (docs/DECISIONS.md)

  • ADR-001: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. 6 sub-decisions, all final.
  • ADR-002: storaged per-prefix PUT cap (4 GiB for _vectors/, 256 MiB elsewhere) — implemented at 423a381. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
  • ADR-003: Inter-service auth = Bearer + IP allowlist, opt-in via cfg.Auth.Token. Wiring deferred to Sprint 1 but the design is locked — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
  • ADR-004: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque json.RawMessage content. Implemented in internal/pathway/.

Today's scrum dispositions (2026-04-30)

Verbatim verdicts at reports/scrum/_evidence/2026-04-30/verdicts/. Disposition table: reports/scrum/_evidence/2026-04-30/disposition.md.

Real findings, all fixed in 0efc736:

  • B-1 (Opus+Kimi convergent): ResolveKey 3-arg API → 2-arg
  • B-2 (Opus+Kimi convergent): handleProviders direct map lookup, drop synthesis-via-Resolve
  • B-3 (Opus single, trace-verified): OllamaCloud.Chat strips ollama_cloud/ prefix correctly
  • B-4 (Opus single): Ollama done_reason surfaced to FinishReason

False positives dismissed (3, documented):

  • FP-A1: Kimi misread TestMaybeDowngrade_WithConfigList assertion
  • FP-A2: Qwen claimed nil-deref in MaybeDowngrade that doesn't exist
  • FP-C1: Opus claimed qwen3.5:latest doesn't exist on Ollama hub (it does on this box's local install)

Session frame (don't redo)

  • The Rust legacy is maintenance-only until Go reaches feature parity. Don't propose ports of components already shipped here.
  • The matrix indexer 5/5 components are shipped. Don't propose to "build the matrix indexer" — it's done.
  • The 5-loop substrate's load-bearing gate is PASSED. v3 (154a72e) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
  • Shape B is the playbook stance now. When use_playbook=true, both ApplyPlaybookBoost (re-rank in place) AND InjectPlaybookMisses (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
  • local_judge = "qwen2.5:latest" for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
  • qwen3.5:latest IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
  • temperature is omitted for Anthropic 4.7 (handled by Request.Temperature *float64); don't re-add it.
  • chatd-smoke runs with all cloud providers disabled intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).

OPEN — what's not done yet

Item What When to act
Reality test v4: re-judge warm results Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best.
Adjacent-query cross-pollination After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. Co-decision with v4 re-judge.
Liberal-paraphrase recovery loss Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair paraphrase_max_drift measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. When real coordinator queries are available for a calibration run.
Sprint 4 — deployment No REPLICATION.md, secrets-go.toml.example, deploy/systemd/<bin>.service, Dockerfile. Largest open Sprint. Required input for any G5 cutover plan. When G5 cutover is on the table.
ADR-006 — auth posture for non-loopback deploy Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. Required before any Go binary binds non-loopback in prod.
chatd fixture-mode storage half g2_smoke_fixtures.sh closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. When CI box without MinIO is needed.
Distillation full port 57d0df1 shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. When distillation is needed for production.
Drift full quantification be65f85 is "scorer drift first." Full distribution-drift signal underspecified everywhere — research gap, not a port. Open research item.

RECENT VERIFIED WAVE (2026-04-30)

05273ac..e4ee002 — 4 phases + scrum + tooling, all gate-tested.

SHA What
ec1d031 Phase 1: [models] tier config (additive, no callers migrate)
622e124 Phase 2: matrix.downgrade reads cfg.Models.WeakModels
848cbf5 Phase 3: playbook_lift harness defaults from config
05273ac Phase 4: chatd + 5 providers (1,624 LoC)
0efc736 Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review
e4ee002 scripts/scrum_review.sh — reusable 3-lineage driver
b2e45f7 playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%)
6c02c90 scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast)
2c71d1c ADR-005: observer fail-safe semantics
9ce067b observerd: test that locks ADR-005 5.3 (provenance recorded post-run)
e9822f0 playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery)
154a72e matrix: Shape B (InjectPlaybookMisses) — 6/6 paraphrase recovery in run #003
94fc3b6 STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination
67d1957 matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8

Plus on Rust side (8de94eb, 3d06868): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).


RUNTIME CHEATSHEET

# Verify everything green
cd /home/profit/golangLAKEHOUSE
just verify                                            # vet + tests + 9 core smokes (~31s)
just doctor                                            # dep probe (go/gcc/minio/ollama/secrets)

# Boot the chat dispatcher (Phase 4)
nohup ./bin/chatd   -config lakehouse.toml > /tmp/chatd.log   2>&1 & disown
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq  # all 5 providers should report true

# Test a chat call to each lineage
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
  curl -sS -X POST http://127.0.0.1:3110/v1/chat \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
    | jq -c '{model,provider,content}'
done

# Run the scrum on a diff
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/

# Domain smokes (not in `just verify`)
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
  bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
done

VISION — what we're actually building

J's framing (canonical at /root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via frontier_* tier. The hot path runs on local qwen3.5:latest given:

  1. Pathway memory — what we tried before, how it went (Mem0 substrate ✓)
  2. Matrix indexer — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
  3. Observer — watches each run, refines configs (not prompts) toward good pathways

Successful runs get rated and distilled back into the playbook. Each iteration the playbook gets denser, runs get cheaper, results get better. Drift in the distilled playbook is a measured signal, not vibes.

The single load-bearing gate: "the playbook + matrix indexer must give the results we're looking for." Throughput, scaling, code elegance are all secondary. The playbook_lift reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).

When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.


SIBLING TOOLS (separate repos, intentional integration target later)

local-review-harness at git.agentview.dev/profit/local-review-harness (also SMB-mounted at /home/profit/share/local-review-harness-full-md/). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases CE (Ollama LLM review, validation, memory) pending.

Cross-pollination plan when both stabilize:

  • Replace harness's internal/llm/ollama.go with a chatd /v1/chat client → frontier judges via config toggle
  • Feed harness findings into Lakehouse pathway memory as a drift signal
  • Treat harness's .memory/known-risks.json as a matrix-indexer corpus

Detail at docs/SPEC.md §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.