root 5a3364f539 matrix: judge-gated Shape B inject — closes lift-suite tail issues

Lift suite run #004 left two unresolved tail issues:
- Q6 ("Forklift loader") ↔ Q7 ("Hazmat warehouse, cold storage")
  swap recordings as warm top-1 because their embeddings are within
  0.20 cosine of each other. Distance gate can't tell them apart.
- Q9 + Q15 lose paraphrase recovery when qwen2.5 rephrases past the
  0.20 threshold. Distance says "drift too far"; sometimes the drift
  is real (skip), sometimes the paraphrase is still on-domain (don't
  want to skip).

Multi-coord run #008's judge re-rating proved the LLM can
distinguish: Q3 crane case landed at distance 0.23 (looks tight)
but rating 1 (irrelevant). The judge sees domain mismatch the
embedder doesn't.

This commit lifts that pattern into the matrix substrate. Shape B
inject now optionally routes every candidate through a judge gate
before the rank insert lands. Distance + judge BOTH have to approve.

internal/matrix/playbook.go:
- InjectPlaybookMisses signature gains a query string + an
  optional InjectGate. nil gate preserves pre-judge-gating
  behavior (current tests already pass with nil).
- New InjectGate interface + InjectGateFunc adapter for tests
  and non-LLM callers.
- Per-candidate gate.Approve(query, hit) call inserted between
  the dedup and the inject. Rejected candidates skip silently;
  injected count reflects post-gate decision.

internal/matrix/judge.go (new, ~140 lines):
- LLMJudgeGate calls an Ollama-shape /api/chat endpoint with the
  same 1-5 staffing-rubric prompt that worked in multi_coord
  run #008. fail-closed on HTTP/JSON errors (don't inject if
  judge can't speak — better miss than wrong-domain).
- NewLLMJudgeGate returns nil when URL or Model is empty,
  matching InjectGate's nil-means-no-judge semantics.

internal/matrix/retrieve.go:
- SearchRequest gains JudgeURL, JudgeModel, JudgeMinRating
  fields. Run() builds an LLMJudgeGate when set; passes nil
  otherwise. Backward compatible — existing callers see no
  behavior change.

Tests:
- TestInjectPlaybookMisses_GateRejectsCandidate (rejectAll → 0
  injected, even with tight distance)
- TestInjectPlaybookMisses_GateApprovesCandidate (approveAll →
  same as nil-gate behavior)
- TestInjectPlaybookMisses_GateSeesCorrectQuery (gate receives
  CURRENT query + RECORDED query separately so it can score
  the (current, candidate) pair)
- All 5 existing inject tests updated to new signature

go test ./internal/matrix → all 8 inject tests pass.
go test ./internal/matrix ./internal/shared ./cmd/{matrixd,
queryd,pathwayd,observerd} → all green.

STATE_OF_PLAY:
- OPEN item #1 (judge-gated injection) closed.
- DO NOT RELITIGATE adds the substrate-level judge-gate lock.
- OPEN list now 5 rows (was 6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 19:38:12 -05:00

27 KiB

Raw Blame History

STATE OF PLAY — Lakehouse-Go

Last verified: 2026-04-30 ~16:42 CDT Verified by: live probes + just verify PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.

Read this FIRST. When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at /home/profit/lakehouse/ is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

VERIFIED WORKING RIGHT NOW

Substrate (G0 + G1 family)

13 service binaries under cmd/ plus 2 driver scripts under scripts/staffing_* build into bin/. 18 smoke scripts all PASS. just verify (vet + 30 packages × short tests + 9 core smokes) green in ~31s wall.

Binary	Port	What
`gateway`	3110	reverse proxy, single OpenAI-compat-style edge
`storaged`	3211	S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002)
`catalogd`	3212	Parquet manifests, ADR-020 idempotent register
`ingestd`	3213	CSV → Parquet → catalogd, content-addressed keys
`queryd`	3214	DuckDB SELECT over Parquet via httpfs
`vectord`	3215	HNSW indexes (coder/hnsw), persistence to storaged
`embedd`	3216	Ollama-backed embedder w/ LRU cache
`pathwayd`	3217	Mem0 ops (Add/Update/Revise/Retire/History/Search)
`matrixd`	3218	Multi-corpus retrieve+merge + relevance + downgrade + playbook
`observerd`	3219	Witness loop, workflow runner with DAG executor
`chatd`	3220	LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi
`mcpd`	—	MCP SDK port (Bun mcp-server replacement)
`fake_ollama`	—	Test fixture (used by `g2_smoke_fixtures.sh`)

Matrix indexer — all 5 SPEC §3.4 components shipped

Corpus builders (internal/corpusingest)
Multi-corpus retrieve+merge (matrixd /matrix/search)
Relevance filter (internal/matrix/relevance.go 376 LoC + 289 LoC test)
Strong-model downgrade gate (internal/matrix/downgrade.go, reads cfg.Models.WeakModels after Phase 2)
Playbook memory: boost + Shape B inject (internal/matrix/playbook.go, learning loop). Shape B (InjectPlaybookMisses, 154a72e) appends recorded answers to results when regular retrieval misses them — closed the paraphrase recovery gap exposed by run #002 (0/2) and validated by run #003 (6/6).

Pathway memory (Mem0 substrate)

Full ADR-004 surface shipped. Cycle-detection + retired-trace exclusion proven by tests: TestHistory_CycleDetected, TestRetire_ExcludedFromSearch, TestRevise_ChainOfThree_BackwardWalk. JSONL append-only persistence with corruption tolerance.

Observer + workflow runner

observerd ring buffer + JSONL persistence
Workflow DAG executor (Archon-style) with 5 native modes wired: matrix.relevance, matrix.downgrade, matrix.search, distillation.score, drift.scorer. Plus fixture.echo / fixture.upper for runner mechanics smokes.

Distillation + drift

E (partial) at 57d0df1 — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
F (first slice) at be65f85 — drift quantification, scorer drift first.

chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)

Multi-provider LLM dispatcher routing /v1/chat by model-name prefix or :cloud suffix:

Prefix / suffix	Provider	Auth
`ollama/<m>` or bare	`ollama` (local)	none
`ollama_cloud/<m>` or `<m>:cloud`	`ollama_cloud`	Bearer (OLLAMA_CLOUD_KEY)
`openrouter/<v>/<m>`	`openrouter`	Bearer (OPENROUTER_API_KEY)
`opencode/<m>`	`opencode`	Bearer (OPENCODE_API_KEY)
`kimi/<m>`	`kimi`	Bearer (KIMI_API_KEY)

All 5 keys live in /etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}.

Request.Temperature is *float64 (pointer) — Anthropic 4.7 deprecates temperature entirely, so we omit the field when caller doesn't set it.

Model tier registry

lakehouse.toml [models] names model IDs by tier so swaps are 1-line:

local_fast       = "qwen3.5:latest"
local_judge      = "qwen2.5:latest"   # NOT qwen3.5 — vision-SSM 256K-ctx ran 30s/judge in lift loop
cloud_judge      = "kimi-k2.6:cloud"
cloud_review     = "qwen3-coder:480b"
frontier_review  = "openrouter/anthropic/claude-opus-4-7"
frontier_arch    = "openrouter/moonshotai/kimi-k2-0905"
frontier_free    = "opencode/claude-opus-4-7"
weak_models      = ["qwen3.5:latest", "qwen3:latest"]   # matrix.downgrade bypass

Callers read cfg.Models.LocalJudge etc. instead of literal strings. playbook_lift harness, matrix.downgrade, and observerd's MatrixDowngradeWithWeakList factory all migrated.

Code health

go vet ./... → 0 warnings, 0 errors
go test -short ./... → all green, 349 test functions
just verify → PASS (vet + tests + 9 smokes) in ~31s
18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)

Latest scrum: 2026-04-30 cross-lineage wave

Composite 50/60 at scrum2 head c7e3124 (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own /v1/chat; 2 BLOCKs + 2 WARNs landed as fixes (0efc736); reusable driver at scripts/scrum_review.sh.

Reality tests #001–#003 — load-bearing gate verified (2026-04-30 ~05:50–07:05 CDT)

The 5-loop substrate's load-bearing gate (per project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for") is verified for both verbatim replay and paraphrase queries.

Run	Stance	Verbatim lift	Paraphrase recovery	What it proved
`playbook_lift_001`	boost-only	7/8 (87.5%)	not tested	Cosine + boost re-rank works for verbatim replay. Substrate live.
`playbook_lift_002`	boost-only	2/2	0/2	Boost can't promote answers OUT of regular top-K — paraphrase gap exposed.
`playbook_lift_003`	Shape B (loose 0.5)	2/6	6/6 → top-1	Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE).
`playbook_lift_004`	Shape B + split threshold (0.5 boost / 0.20 inject)	6/8 (75%)	6/8 (75%)	OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration.

Shape B (InjectPlaybookMisses in internal/matrix/playbook.go): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = playbook_hit_distance × BoostFactor. Caller re-sorts + truncates. Documented at playbook.go:22-27 since v0; v3 shipped the implementation. v4 added the split-threshold defense (DefaultPlaybookMaxInjectDistance = 0.20 while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.

OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.

Evidence: reports/reality-tests/playbook_lift_{001,002,003}.{json,md}. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.

v3 → v4 is the configuration evolution. v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end

Reality test #2 catalog. New harness scripts/multi_coord_stress.{sh,go} simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.

Capability	Verified	Where
Per-coordinator playbook isolation	✓	`playbook_alice` / `playbook_bob` / `playbook_carol` corpora
Same-role-across-contracts diversity	Jaccard 0.026 (n=9) — 97% workers differ per region	Phase 1 baseline
Different-roles-same-contract diversity	Jaccard 0.070 (n=18) — 93% differ per role	Phase 1 baseline
HNSW retrieval determinism	Jaccard 1.000 (n=12)	Phase 6 reissue
Verbatim handover (Bob runs Alice's queries with Alice's playbook)	4/4	Phase 4
Paraphrase handover (Bob runs qwen2.5-paraphrased queries)	4/4	Phase 4b
200-worker swap with `ExcludeIDs`	Jaccard 0.000 — 8/8 placed workers fully replaced	Phase 2b
Fresh-resume injection (two-tier `fresh_workers` index)	3/3 fresh workers at top-1	Phase 1b
Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting)	6/6 events recorded	Phase 1c
LLM demand parsing (qwen2.5 format=json on inbox bodies)	6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}`	Phase 1c
Judge re-rates inbox top-1 against ORIGINAL body	catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1)	Phase 1c
Langfuse Go-side tracing	111 observations on a single run trace, browseable at http://localhost:3001	Run #011

Substrate gains added by this wave:

internal/matrix/playbook.go Shape B + split inject threshold (commit 67d1957 from earlier wave; verified in multi-coord too)
internal/matrix/retrieve.go ExcludeIDs field on SearchRequest — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
internal/observer/types.go SourceInbox taxonomy alongside SourceMCP / SourceScenario / SourceWorkflow
cmd/observerd POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp. Type must be email or sms; body required; priority defaults to medium.
internal/langfuse/client.go — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
embedd default_model bumped from nomic-embed-text (137M) → nomic-embed-text-v2-moe (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
Two-tier index pattern: fresh content goes to fresh_workers (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.

Harness expansion (2026-04-30 ~05:30 CDT)

scripts/playbook_lift.sh rewritten from a 5-daemon stripped harness to the full 10-daemon prod-realistic stack (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:

#	Fix	Lock
1	driver→matrixd: `query` → `query_text` field name	`cmd/matrixd/main_test.go` TestPlaybookRecord_OldFieldNameRejected
2	harness toml missing `[s3]` block	inline comment in `scripts/playbook_lift.sh`
3	harness→queryd: `q` → `sql` field name	`cmd/queryd/main_test.go` TestHandleSQL_WrongFieldName_400
4	5→10 daemon boot order	inline comment + dep-ordered launch
5	SQL surface probe (3-row CSV → COUNT=3)	`[lift] ✓ SQL surface probe passed` assertion
6	`candidates` corpus was SWE-tech, not staffing	swapped to `ethereal_workers.parquet` (10K rows, real staffing schema, "e-" id prefix)
7	`qwen3.5:latest` is vision-SSM 256K-ctx → 30s/judge	reverted `local_judge` to `qwen2.5:latest` (1s/judge, 30× faster)

R-005 closed (2026-04-30 ~05:35 CDT)

Four new cmd/<bin>/main_test.go files — chi router-level contract tests:

cmd/matrixd/main_test.go (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
cmd/queryd/main_test.go (extended) — wrong-field-name drift detector
cmd/pathwayd/main_test.go (102 lines) — 9 routes + add round-trip + retire-nonexistent
cmd/observerd/main_test.go (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400

go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. R-005 from prior STATE OPEN list is closed.

DO NOT RELITIGATE

Ratified ADRs (`docs/DECISIONS.md`)

ADR-001: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. 6 sub-decisions, all final.
ADR-002: storaged per-prefix PUT cap (4 GiB for _vectors/, 256 MiB elsewhere) — implemented at 423a381. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
ADR-003: Inter-service auth = Bearer + IP allowlist, opt-in via cfg.Auth.Token. Wiring deferred to Sprint 1 but the design is locked — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
ADR-004: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque json.RawMessage content. Implemented in internal/pathway/.

Today's scrum dispositions (2026-04-30)

Verbatim verdicts at reports/scrum/_evidence/2026-04-30/verdicts/. Disposition table: reports/scrum/_evidence/2026-04-30/disposition.md.

Real findings, all fixed in 0efc736:

B-1 (Opus+Kimi convergent): ResolveKey 3-arg API → 2-arg
B-2 (Opus+Kimi convergent): handleProviders direct map lookup, drop synthesis-via-Resolve
B-3 (Opus single, trace-verified): OllamaCloud.Chat strips ollama_cloud/ prefix correctly
B-4 (Opus single): Ollama done_reason surfaced to FinishReason

False positives dismissed (3, documented):

FP-A1: Kimi misread TestMaybeDowngrade_WithConfigList assertion
FP-A2: Qwen claimed nil-deref in MaybeDowngrade that doesn't exist
FP-C1: Opus claimed qwen3.5:latest doesn't exist on Ollama hub (it does on this box's local install)

Session frame (don't redo)

The Rust legacy is maintenance-only until Go reaches feature parity. Don't propose ports of components already shipped here.
The matrix indexer 5/5 components are shipped. Don't propose to "build the matrix indexer" — it's done.
The 5-loop substrate's load-bearing gate is PASSED. v3 (154a72e) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
Shape B is the playbook stance now. When use_playbook=true, both ApplyPlaybookBoost (re-rank in place) AND InjectPlaybookMisses (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
Boost / inject use SEPARATE thresholds. Boost stays at DefaultPlaybookMaxDistance = 0.5 (safe — only re-ranks results already in retrieval). Inject uses tighter DefaultPlaybookMaxInjectDistance = 0.20 (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
Multi-coord product theory is empirically VALIDATED by run #011 (Phase 3). Per-coordinator playbook namespaces (playbook_alice etc.) with cross-coordinator handover (Bob takes Alice's contract using playbook_corpus=playbook_alice) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
Auth posture is locked per ADR-006. Non-loopback bind requires auth.token (mechanical gate at shared.Run). Operators set the token via token_env (defaults to AUTH_TOKEN) loaded by systemd EnvironmentFile=/etc/lakehouse/auth.env, NOT in the committed TOML. Internal services use AllowedIPs; external boundary uses Bearer. Token rotation is dual-token via secondary_tokens. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
Shape B inject has a judge-gate substrate. InjectPlaybookMisses takes an optional InjectGate (interface) that approves each candidate before the rank insert. LLMJudgeGate (Ollama-shape /api/chat client) is the default impl; nil gate = pre-judge-gating distance-only behavior preserved for backward compat. Caller wires via SearchRequest.{JudgeURL, JudgeModel, JudgeMinRating}. Closes the lift-suite tail issues (Q6↔Q7 adjacent-query swap + Q9/Q15 paraphrase drift) at substrate level.
Fresh content uses two-tier indexing. Fresh resumes go to fresh_workers corpus, not the main workers index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
embedd.default_model = "nomic-embed-text-v2-moe" (475M MoE, 768d). Don't bump to nomic-embed-text (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
Inbox flow: parse + search + judge + trace. /v1/observer/inbox records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs matrix.search on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
local_judge = "qwen2.5:latest" for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
qwen3.5:latest IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
temperature is omitted for Anthropic 4.7 (handled by Request.Temperature *float64); don't re-add it.
chatd-smoke runs with all cloud providers disabled intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
Langfuse Go-side client lives at internal/langfuse/ with best-effort fail-open posture. URL+creds from /etc/lakehouse/langfuse.env. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.

OPEN — what's not done yet

The list is intentionally short. Items move to closed when the work demands them, not on a calendar. Ordered by leverage on the active product theory (multi-coord staffing co-pilot via the 5-loop substrate), not by effort.

#	Item	When to act
1	Wider Langfuse instrumentation across daemons — `internal/langfuse/middleware.go` that auto-emits one span per HTTP request from every daemon's `shared.Run`. Production traffic gets free trace visibility without per-handler wiring.	When production traffic actually starts hitting the gateway.
2	Periodic fresh→main index merge — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop.	When `fresh_workers` crosses ~500 items in production.
3	Distillation full port — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side.	When distillation becomes a production dependency.
4	Drift quantification — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port.	Open research item; no calendar.
5	Operational nice-to-haves — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land.	When any of these block someone.

RECENT VERIFIED WAVE (2026-04-30)

05273ac..e4ee002 — 4 phases + scrum + tooling, all gate-tested.

SHA	What
`ec1d031`	Phase 1: `[models]` tier config (additive, no callers migrate)
`622e124`	Phase 2: `matrix.downgrade` reads `cfg.Models.WeakModels`
`848cbf5`	Phase 3: `playbook_lift` harness defaults from config
`05273ac`	Phase 4: chatd + 5 providers (1,624 LoC)
`0efc736`	Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review
`e4ee002`	`scripts/scrum_review.sh` — reusable 3-lineage driver
`b2e45f7`	playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%)
`6c02c90`	scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast)
`2c71d1c`	ADR-005: observer fail-safe semantics
`9ce067b`	observerd: test that locks ADR-005 5.3 (provenance recorded post-run)
`e9822f0`	playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery)
`154a72e`	matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003
`94fc3b6`	STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination
`67d1957`	matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8
`b13b5cd`	playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed)
`61c7b55`	multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario
`0fa42a0`	multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover
`84a32f0`	multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume
`4da32ad`	embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
`e7fc63b`	observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events)
`186d209`	multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json)
`ce940f4`	multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal
`7e6431e`	langfuse: Go-side client + Phase 1c instrumentation
`08a0867`	multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3)
`5d49967`	multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations)

Plus on Rust side (8de94eb, 3d06868): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

RUNTIME CHEATSHEET

# Verify everything green
cd /home/profit/golangLAKEHOUSE
just verify                                            # vet + tests + 9 core smokes (~31s)
just doctor                                            # dep probe (go/gcc/minio/ollama/secrets)

# Boot the chat dispatcher (Phase 4)
nohup ./bin/chatd   -config lakehouse.toml > /tmp/chatd.log   2>&1 & disown
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq  # all 5 providers should report true

# Test a chat call to each lineage
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
  curl -sS -X POST http://127.0.0.1:3110/v1/chat \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
    | jq -c '{model,provider,content}'
done

# Run the scrum on a diff
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/

# Domain smokes (not in `just verify`)
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
  bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
done

VISION — what we're actually building

J's framing (canonical at /root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via frontier_* tier. The hot path runs on local qwen3.5:latest given:

Pathway memory — what we tried before, how it went (Mem0 substrate ✓)
Matrix indexer — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
Observer — watches each run, refines configs (not prompts) toward good pathways

Successful runs get rated and distilled back into the playbook. Each iteration the playbook gets denser, runs get cheaper, results get better. Drift in the distilled playbook is a measured signal, not vibes.

The single load-bearing gate: "the playbook + matrix indexer must give the results we're looking for." Throughput, scaling, code elegance are all secondary. The playbook_lift reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).

When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.

SIBLING TOOLS (separate repos, intentional integration target later)

local-review-harness at git.agentview.dev/profit/local-review-harness (also SMB-mounted at /home/profit/share/local-review-harness-full-md/). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases C–E (Ollama LLM review, validation, memory) pending.

Cross-pollination plan when both stabilize:

Replace harness's internal/llm/ollama.go with a chatd /v1/chat client → frontier judges via config toggle
Feed harness findings into Lakehouse pathway memory as a drift signal
Treat harness's .memory/known-risks.json as a matrix-indexer corpus

Detail at docs/SPEC.md §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.

27 KiB Raw Blame History Unescape Escape