The original OPEN #2 line called for "SFT export pipeline + audit_baselines lineage." Commit 7bb432f shipped the SFT export. This commit ports the audit_baselines half — the longitudinal drift signal that distinguishes "metrics shifted because the world changed" from "metrics shifted because we broke something." Mirrors Rust scripts/distillation/audit_full.ts's substrate: - LoadLastBaseline(path) reads the most recent entry from data/_kb/audit_baselines.jsonl. Returns (nil, nil) on missing file (first run), errors on truncated last line (partial-write detection — operators don't lose drift signal silently). - AppendBaseline(path, baseline) appends one entry as a JSON line. Atomic at the line level via bufio + O_APPEND. Creates the parent directory if missing. - BuildAuditDriftTable(prior, current, threshold) computes per-metric drift. flag values mirror Rust exactly: first_run, ok, warn. DefaultDriftWarnThreshold = 0.20 = Rust's 20%. - FormatAuditDriftTable renders a fixed-width text grid for stdout dumps in audit-full runs. Edge cases handled: - Zero-baseline: prior=0 means no division — PctChange stays nil. current=0 → ok (no change). current>0 → warn (zero→nonzero is always notable, never silently fine). - New metric in current: flagged first_run, not "0%-change". Operators see "this is a new signal we haven't tracked before." - Sort: stable by metric name for deterministic JSON output and clean CI diffs. Generic on metric name (vs Rust's pinned p2_evidence_rows etc.): the Rust phase numbering doesn't translate to Go directly. The AuditBaselineRustCompat constant pins the Rust names so operators running both runtimes use the same labels, which makes drift comparison meaningful across the two pipelines. 13 new tests covering: missing file, last-line-wins, blank-line tolerance, malformed-line errors, append round-trip, append-to- existing, schema validation, first-run, threshold boundary, zero-baseline, new-metric-in-current, sort-by-metric stability, formatter output rendering. OPEN #2's "audit_baselines lineage" half now closed. The distillation package surface is at parity with the Rust pipeline: scorer, scored runs, SFT export, audit baselines all available on the Go side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
338 lines
33 KiB
Markdown
338 lines
33 KiB
Markdown
# STATE OF PLAY — Lakehouse-Go
|
||
|
||
**Last verified:** 2026-04-30 ~16:42 CDT
|
||
**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
|
||
|
||
> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
|
||
|
||
---
|
||
|
||
## VERIFIED WORKING RIGHT NOW
|
||
|
||
### Substrate (G0 + G1 family)
|
||
|
||
13 service binaries under `cmd/` plus 2 driver scripts under `scripts/staffing_*` build into `bin/`. **18 smoke scripts all PASS.** `just verify` (vet + 30 packages × short tests + 9 core smokes) green in ~31s wall.
|
||
|
||
| Binary | Port | What |
|
||
|---|---|---|
|
||
| `gateway` | 3110 | reverse proxy, single OpenAI-compat-style edge |
|
||
| `storaged` | 3211 | S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002) |
|
||
| `catalogd` | 3212 | Parquet manifests, ADR-020 idempotent register |
|
||
| `ingestd` | 3213 | CSV → Parquet → catalogd, content-addressed keys |
|
||
| `queryd` | 3214 | DuckDB SELECT over Parquet via httpfs |
|
||
| `vectord` | 3215 | HNSW indexes (coder/hnsw), persistence to storaged |
|
||
| `embedd` | 3216 | Ollama-backed embedder w/ LRU cache |
|
||
| `pathwayd` | 3217 | Mem0 ops (Add/Update/Revise/Retire/History/Search) |
|
||
| `matrixd` | 3218 | Multi-corpus retrieve+merge + relevance + downgrade + playbook |
|
||
| `observerd` | 3219 | Witness loop, workflow runner with DAG executor |
|
||
| `chatd` | 3220 | LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi |
|
||
| `mcpd` | — | MCP SDK port (Bun mcp-server replacement) |
|
||
| `fake_ollama` | — | Test fixture (used by `g2_smoke_fixtures.sh`) |
|
||
|
||
### Matrix indexer — all 5 SPEC §3.4 components shipped
|
||
|
||
1. **Corpus builders** (`internal/corpusingest`)
|
||
2. **Multi-corpus retrieve+merge** (`matrixd /matrix/search`)
|
||
3. **Relevance filter** (`internal/matrix/relevance.go` 376 LoC + 289 LoC test)
|
||
4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`, reads `cfg.Models.WeakModels` after Phase 2)
|
||
5. **Playbook memory: boost + Shape B inject** (`internal/matrix/playbook.go`, learning loop). Shape B (`InjectPlaybookMisses`, `154a72e`) appends recorded answers to results when regular retrieval misses them — closed the paraphrase recovery gap exposed by run #002 (0/2) and validated by run #003 (6/6).
|
||
|
||
### Pathway memory (Mem0 substrate)
|
||
|
||
Full ADR-004 surface shipped. **Cycle-detection + retired-trace exclusion proven by tests:** `TestHistory_CycleDetected`, `TestRetire_ExcludedFromSearch`, `TestRevise_ChainOfThree_BackwardWalk`. JSONL append-only persistence with corruption tolerance.
|
||
|
||
### Observer + workflow runner
|
||
|
||
- `observerd` ring buffer + JSONL persistence
|
||
- Workflow DAG executor (Archon-style) with 5 native modes wired: `matrix.relevance`, `matrix.downgrade`, `matrix.search`, `distillation.score`, `drift.scorer`. Plus `fixture.echo` / `fixture.upper` for runner mechanics smokes.
|
||
|
||
### Distillation + drift
|
||
|
||
- **E (partial)** at `57d0df1` — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
|
||
- **F (first slice)** at `be65f85` — drift quantification, scorer drift first.
|
||
|
||
### chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)
|
||
|
||
Multi-provider LLM dispatcher routing `/v1/chat` by model-name prefix or `:cloud` suffix:
|
||
|
||
| Prefix / suffix | Provider | Auth |
|
||
|---|---|---|
|
||
| `ollama/<m>` or bare | `ollama` (local) | none |
|
||
| `ollama_cloud/<m>` or `<m>:cloud` | `ollama_cloud` | Bearer (OLLAMA_CLOUD_KEY) |
|
||
| `openrouter/<v>/<m>` | `openrouter` | Bearer (OPENROUTER_API_KEY) |
|
||
| `opencode/<m>` | `opencode` | Bearer (OPENCODE_API_KEY) |
|
||
| `kimi/<m>` | `kimi` | Bearer (KIMI_API_KEY) |
|
||
|
||
All 5 keys live in `/etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env` files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: `POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}`.
|
||
|
||
`Request.Temperature` is `*float64` (pointer) — Anthropic 4.7 deprecates `temperature` entirely, so we omit the field when caller doesn't set it.
|
||
|
||
### Model tier registry
|
||
|
||
`lakehouse.toml [models]` names model IDs by tier so swaps are 1-line:
|
||
|
||
```toml
|
||
local_fast = "qwen3.5:latest"
|
||
local_judge = "qwen2.5:latest" # NOT qwen3.5 — vision-SSM 256K-ctx ran 30s/judge in lift loop
|
||
cloud_judge = "kimi-k2.6:cloud"
|
||
cloud_review = "qwen3-coder:480b"
|
||
frontier_review = "openrouter/anthropic/claude-opus-4-7"
|
||
frontier_arch = "openrouter/moonshotai/kimi-k2-0905"
|
||
frontier_free = "opencode/claude-opus-4-7"
|
||
weak_models = ["qwen3.5:latest", "qwen3:latest"] # matrix.downgrade bypass
|
||
```
|
||
|
||
Callers read `cfg.Models.LocalJudge` etc. instead of literal strings. `playbook_lift` harness, `matrix.downgrade`, and observerd's `MatrixDowngradeWithWeakList` factory all migrated.
|
||
|
||
### Code health
|
||
|
||
- `go vet ./...` → **0 warnings, 0 errors**
|
||
- `go test -short ./...` → **all green**, 349 test functions
|
||
- `just verify` → PASS (vet + tests + 9 smokes) in ~31s
|
||
- 18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)
|
||
|
||
### Latest scrum: 2026-04-30 cross-lineage wave
|
||
|
||
Composite **50/60** at scrum2 head `c7e3124` (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own `/v1/chat`; **2 BLOCKs + 2 WARNs landed as fixes** (`0efc736`); reusable driver at `scripts/scrum_review.sh`.
|
||
|
||
### Reality tests #001–#003 — load-bearing gate verified (2026-04-30 ~05:50–07:05 CDT)
|
||
|
||
The 5-loop substrate's load-bearing gate (per `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for"*) is verified for both **verbatim replay** and **paraphrase queries**.
|
||
|
||
| Run | Stance | Verbatim lift | Paraphrase recovery | What it proved |
|
||
|---|---|---|---|---|
|
||
| `playbook_lift_001` | boost-only | **7/8 (87.5%)** | not tested | Cosine + boost re-rank works for verbatim replay. Substrate live. |
|
||
| `playbook_lift_002` | boost-only | 2/2 | **0/2** | Boost can't promote answers OUT of regular top-K — paraphrase gap exposed. |
|
||
| `playbook_lift_003` | Shape B (loose 0.5) | 2/6 | 6/6 → top-1 | Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE). |
|
||
| `playbook_lift_004` | **Shape B + split threshold (0.5 boost / 0.20 inject)** | **6/8 (75%)** | **6/8 (75%)** | OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration. |
|
||
|
||
**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation. v4 added the split-threshold defense (`DefaultPlaybookMaxInjectDistance = 0.20` while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.
|
||
|
||
OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.
|
||
|
||
Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.
|
||
|
||
**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
|
||
|
||
### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
|
||
|
||
Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
|
||
|
||
| Capability | Verified | Where |
|
||
|---|---|---|
|
||
| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
|
||
| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
|
||
| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
|
||
| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
|
||
| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
|
||
| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
|
||
| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
|
||
| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
|
||
| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
|
||
| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
|
||
| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
|
||
| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
|
||
|
||
**Substrate gains added by this wave:**
|
||
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
|
||
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
|
||
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
|
||
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
|
||
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
|
||
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
|
||
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
|
||
|
||
### Harness expansion (2026-04-30 ~05:30 CDT)
|
||
|
||
`scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
|
||
|
||
| # | Fix | Lock |
|
||
|---|---|---|
|
||
| 1 | driver→matrixd: `query` → `query_text` field name | `cmd/matrixd/main_test.go` TestPlaybookRecord_OldFieldNameRejected |
|
||
| 2 | harness toml missing `[s3]` block | inline comment in `scripts/playbook_lift.sh` |
|
||
| 3 | harness→queryd: `q` → `sql` field name | `cmd/queryd/main_test.go` TestHandleSQL_WrongFieldName_400 |
|
||
| 4 | 5→10 daemon boot order | inline comment + dep-ordered launch |
|
||
| 5 | SQL surface probe (3-row CSV → COUNT=3) | `[lift] ✓ SQL surface probe passed` assertion |
|
||
| 6 | `candidates` corpus was SWE-tech, not staffing | swapped to `ethereal_workers.parquet` (10K rows, real staffing schema, "e-" id prefix) |
|
||
| 7 | `qwen3.5:latest` is vision-SSM 256K-ctx → 30s/judge | reverted `local_judge` to `qwen2.5:latest` (1s/judge, 30× faster) |
|
||
|
||
### R-005 closed (2026-04-30 ~05:35 CDT)
|
||
|
||
Four new `cmd/<bin>/main_test.go` files — chi router-level contract tests:
|
||
|
||
- `cmd/matrixd/main_test.go` (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
|
||
- `cmd/queryd/main_test.go` (extended) — wrong-field-name drift detector
|
||
- `cmd/pathwayd/main_test.go` (102 lines) — 9 routes + add round-trip + retire-nonexistent
|
||
- `cmd/observerd/main_test.go` (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400
|
||
|
||
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. R-005 from prior STATE OPEN list is closed.
|
||
|
||
---
|
||
|
||
## DO NOT RELITIGATE
|
||
|
||
### Ratified ADRs (`docs/DECISIONS.md`)
|
||
|
||
- **ADR-001**: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. **6 sub-decisions, all final.**
|
||
- **ADR-002**: storaged per-prefix PUT cap (4 GiB for `_vectors/`, 256 MiB elsewhere) — implemented at `423a381`. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
|
||
- **ADR-003**: Inter-service auth = Bearer + IP allowlist, opt-in via `cfg.Auth.Token`. Wiring deferred to Sprint 1 but **the design is locked** — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
|
||
- **ADR-004**: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque `json.RawMessage` content. Implemented in `internal/pathway/`.
|
||
|
||
### Today's scrum dispositions (2026-04-30)
|
||
|
||
Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition table: `reports/scrum/_evidence/2026-04-30/disposition.md`.
|
||
|
||
**Real findings, all fixed in `0efc736`:**
|
||
- B-1 (Opus+Kimi convergent): `ResolveKey` 3-arg API → 2-arg
|
||
- B-2 (Opus+Kimi convergent): `handleProviders` direct map lookup, drop synthesis-via-Resolve
|
||
- B-3 (Opus single, trace-verified): `OllamaCloud.Chat` strips `ollama_cloud/` prefix correctly
|
||
- B-4 (Opus single): Ollama `done_reason` surfaced to FinishReason
|
||
|
||
**False positives dismissed (3, documented):**
|
||
- FP-A1: Kimi misread `TestMaybeDowngrade_WithConfigList` assertion
|
||
- FP-A2: Qwen claimed nil-deref in `MaybeDowngrade` that doesn't exist
|
||
- FP-C1: Opus claimed `qwen3.5:latest` doesn't exist on Ollama hub (it does on this box's local install)
|
||
|
||
### Session frame (don't redo)
|
||
|
||
- The Rust legacy is **maintenance-only** until Go reaches feature parity. Don't propose ports of components already shipped here.
|
||
- The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
|
||
- The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
|
||
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
|
||
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
|
||
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
|
||
- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
|
||
- **Shape B inject has a judge-gate substrate.** `InjectPlaybookMisses` takes an optional `InjectGate` (interface) that approves each candidate before the rank insert. `LLMJudgeGate` (Ollama-shape /api/chat client) is the default impl; nil gate = pre-judge-gating distance-only behavior preserved for backward compat. Caller wires via `SearchRequest.{JudgeURL, JudgeModel, JudgeMinRating}`. Closes the lift-suite tail issues (Q6↔Q7 adjacent-query swap + Q9/Q15 paraphrase drift) at substrate level.
|
||
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
|
||
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
|
||
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
|
||
- `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
|
||
- `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
|
||
- `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
|
||
- chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
|
||
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
|
||
|
||
---
|
||
|
||
## OPEN — what's not done yet
|
||
|
||
The list is intentionally short. Items move to closed when the work demands them, not on a calendar. Ordered by leverage on the active product theory (multi-coord staffing co-pilot via the 5-loop substrate), not by effort.
|
||
|
||
**All 4 prior OPEN items closed (substrate or fully) in the 2026-04-30
|
||
"fix the other 4" wave.** No new items pending; the substrate is in
|
||
a steady state. Future items will land here as production triggers fire.
|
||
|
||
---
|
||
|
||
## RECENT VERIFIED WAVE (2026-04-30)
|
||
|
||
`05273ac..e4ee002` — 4 phases + scrum + tooling, all gate-tested.
|
||
|
||
| SHA | What |
|
||
|---|---|
|
||
| `ec1d031` | Phase 1: `[models]` tier config (additive, no callers migrate) |
|
||
| `622e124` | Phase 2: `matrix.downgrade` reads `cfg.Models.WeakModels` |
|
||
| `848cbf5` | Phase 3: `playbook_lift` harness defaults from config |
|
||
| `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) |
|
||
| `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review |
|
||
| `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver |
|
||
| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) |
|
||
| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) |
|
||
| `2c71d1c` | ADR-005: observer fail-safe semantics |
|
||
| `9ce067b` | observerd: test that locks ADR-005 5.3 (provenance recorded post-run) |
|
||
| `e9822f0` | playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery) |
|
||
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
|
||
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
|
||
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
|
||
| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
|
||
| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
|
||
| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
|
||
| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
|
||
| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
|
||
| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
|
||
| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
|
||
| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
|
||
| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
|
||
| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
|
||
| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
|
||
| `68d9e55` | shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 |
|
||
| `a2fa9a2` | scrum_review: pipe diff via temp files — fixes argv overflow on large bundles |
|
||
| (prep) | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. |
|
||
| (probe) | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: `reports/reality-tests/real_001_findings.md`. |
|
||
| (fix) | Cross-role gate: `Role` on `PlaybookEntry`, `QueryRole` on `SearchRequest`, gate fires in both `ApplyPlaybookBoost` + `InjectPlaybookMisses`. `roleEqual` handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: `reports/reality-tests/real_002_findings.md`. |
|
||
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
|
||
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
|
||
| (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
|
||
| (probe) | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
|
||
| (wire-up) | Multi-coord stress role wire-through: `Demand.Role` was already extracted at every call site (44 occurrences) but never threaded into matrix retrieve or playbook record. Cross-role gate was bypassed for the entire multi-coord harness. **Fixed** by extending `tracedSearch`, `matrixSearch`, and `playbookRecord` signatures with `role string` and updating all 14 call sites — passing `d.Role` (demand loops), `parsed.Role` (LLM-parsed inbox path), `warehouseDemand.Role` (swap path), `ev.Role` (reissue path), `""` (fresh-verify resume snippet — no clean role). Build + vet + tests green; multi-coord stress now honors role gate end-to-end. |
|
||
| (close-1) | **OPEN #1: vectord merge endpoint** — `POST /v1/vectors/index/{src}/merge` with body `{dest, clear_source}`. Idempotent on re-runs (existing-in-dest items skipped). New `Index.IDs()` snapshot method backs it; new `i.ids` tracker field is the canonical ID set (independent of meta map's nil-vs-{} sparseness). 4 cmd-level tests + 1 unit test. |
|
||
| (close-2) | **OPEN #2: distillation SFT export substrate** — `internal/distillation/sft_export.go`: `IsSftNever` predicate + `ListScoredRunFiles` (data/scored-runs/YYYY/MM/DD walk) + `LoadScoredRunsFromFile` + partial `ExportSft` that wires the firewall but leaves synthesis (instruction/input/response generation) as the next wave. Firewall pinning test fails if `SftNever` set changes without review. 5 new tests. The synthesis port remains on Rust at `scripts/distillation/export_sft.ts`. |
|
||
| (close-2 full) | **OPEN #2 fully ported** (2026-05-01): `SynthesizeSft` + `LoadEvidenceByRunID` + `buildInstruction` ported byte-for-byte from `scripts/distillation/export_sft.ts`. All 8 source-class instruction templates (scrum_reviews / mode_experiments / auto_apply / audits / observer_reviews / contract_analyses / outcomes / default) match Rust output exactly so a/b validation between runtimes can diff JSONL byte-for-byte. `ExportSft` writes to `data/distilled/sft/sft_export.jsonl`. 5 additional tests including per-source-class template verification, extraction-rejection, empty-text-rejection, context-assembly, end-to-end fixture write. |
|
||
| (close-2 lineage) | **Audit-baselines lineage ported** (2026-05-01): `internal/distillation/audit_baseline.go` mirrors Rust `audit_full.ts`'s LoadBaseline/AppendBaseline/buildDriftTable. `LoadLastBaseline` reads the most recent JSON line from `data/_kb/audit_baselines.jsonl`; `AppendBaseline` appends append-only with bufio. `BuildAuditDriftTable` flags drift `>20%` (configurable); zero-baseline and new-metric edge cases handled (no division-by-zero, no false-stable on zero→nonzero). `FormatAuditDriftTable` for stdout dumps. Generic on metric names so callers running both runtimes can pin Rust-compat names (`AuditBaselineRustCompat` constant lists them). 13 tests including last-line-wins, trailing-blank-tolerance, malformed-line-errors, threshold-boundary, zero-baseline-handling, sort-stability. |
|
||
| (close-3) | **OPEN #3: distribution drift via PSI** — `internal/drift/drift.go`: `ComputeDistributionDrift` returns Population Stability Index + verdict tier (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). Equal-width bucketing over combined min/max range, epsilon-clamping for empty buckets, per-bucket breakdown for drilldown. 7 new tests including identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical-safe, bucket-counts-conserved, num-buckets-clamping. |
|
||
| (close-4) | **OPEN #4: ops nice-to-haves** — (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`); `Output.PhaseTimings` + `Output.TotalElapsedMs` written to JSON; (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger yet, would be speculative. Documented as deferred-until-need rather than ignored. |
|
||
|
||
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
||
|
||
---
|
||
|
||
## RUNTIME CHEATSHEET
|
||
|
||
```bash
|
||
# Verify everything green
|
||
cd /home/profit/golangLAKEHOUSE
|
||
just verify # vet + tests + 9 core smokes (~31s)
|
||
just doctor # dep probe (go/gcc/minio/ollama/secrets)
|
||
|
||
# Boot the chat dispatcher (Phase 4)
|
||
nohup ./bin/chatd -config lakehouse.toml > /tmp/chatd.log 2>&1 & disown
|
||
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
|
||
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq # all 5 providers should report true
|
||
|
||
# Test a chat call to each lineage
|
||
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
|
||
curl -sS -X POST http://127.0.0.1:3110/v1/chat \
|
||
-H 'Content-Type: application/json' \
|
||
-d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
|
||
| jq -c '{model,provider,content}'
|
||
done
|
||
|
||
# Run the scrum on a diff
|
||
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
|
||
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/
|
||
|
||
# Domain smokes (not in `just verify`)
|
||
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
|
||
bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
|
||
done
|
||
```
|
||
|
||
---
|
||
|
||
## VISION — what we're actually building
|
||
|
||
J's framing (canonical at `/root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md`): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via `frontier_*` tier. The hot path runs on local `qwen3.5:latest` given:
|
||
|
||
1. **Pathway memory** — what we tried before, how it went (Mem0 substrate ✓)
|
||
2. **Matrix indexer** — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
|
||
3. **Observer** — watches each run, refines configs (not prompts) toward good pathways
|
||
|
||
Successful runs get **rated and distilled back into the playbook**. Each iteration the playbook gets denser, runs get cheaper, results get better. **Drift** in the distilled playbook is a measured signal, not vibes.
|
||
|
||
**The single load-bearing gate:** *"the playbook + matrix indexer must give the results we're looking for."* Throughput, scaling, code elegance are all secondary. The `playbook_lift` reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).
|
||
|
||
When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.
|
||
|
||
---
|
||
|
||
## SIBLING TOOLS (separate repos, intentional integration target later)
|
||
|
||
**`local-review-harness`** at `git.agentview.dev/profit/local-review-harness` (also SMB-mounted at `/home/profit/share/local-review-harness-full-md/`). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases C–E (Ollama LLM review, validation, memory) pending.
|
||
|
||
**Cross-pollination plan when both stabilize:**
|
||
- Replace harness's `internal/llm/ollama.go` with a chatd `/v1/chat` client → frontier judges via config toggle
|
||
- Feed harness findings into Lakehouse pathway memory as a drift signal
|
||
- Treat harness's `.memory/known-risks.json` as a matrix-indexer corpus
|
||
|
||
Detail at `docs/SPEC.md` §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.
|