Frames the Rust system accurately — it's receiving parity work +
infrastructure (Lance gauntlet, sidecar drop, observability parity),
not just security fixes. Points readers at lakehouse/STATE_OF_PLAY.md
+ docs/ARCHITECTURE_COMPARISON.md for the cross-runtime view.
Also commits today's parity probe report regenerations (5/5 still
32/32 post-Lance work).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to lakehouse 98b6647. Architecture comparison decisions
tracker now captures:
- Go validatord direct header read (fixes 6847bbc): closes the
case where Langfuse-off middleware passthrough silently dropped
forwarded X-Lakehouse-Trace-Id
- Rust IterateResponse trace_id echo (fixes 98b6647): closes the
asymmetry where Go's response carried the join key and Rust's
didn't
- Unified longitudinal log demonstrated end-to-end: both daemons
co-writing /tmp/lakehouse-validator/sessions.jsonl, distinct
daemon tags, one DuckDB query covers both
24/24 parity assertions (validator 6/6, extract_json 12/12,
session_log 4/4, materializer 2/2) hold against live :3100 + :4110.
Both runtimes deployed with today's full stack.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to lakehouse commit 57bde63 (Rust gateway gains
trace-id propagation + coordinator session JSONL). The
cross-runtime parity probe is the regression gate that prevents
silent schema drift between the two runtimes.
scripts/cutover/parity/session_log_parity.sh:
- 4 fixtures (accepted_grounded, max_iter_exhausted, infra_error,
unicode_in_prompt) feed identical input to both helpers
- jq -e validity gate + non-trivial-equal guard prevents the
"both sides fail identically → spurious match" failure mode
(caught one IFS='||' bug during initial authoring — recorded
in the script comment)
- normalize() strips timestamp + daemon (legitimate per-producer
differences); everything else must be byte-equal
- Result: 4/4 fixtures match, including unicode
scripts/cutover/parity/session_log_helper/main.go:
- Tiny stdin/stdout Go helper that round-trips a fixture
through validator.SessionRecord serde
- Counterpart to crates/gateway/src/bin/parity_session_log.rs
docs/ARCHITECTURE_COMPARISON.md decisions tracker:
- "Rust observability parity" row added (DONE 2026-05-02)
- Cross-runtime probe documented as reusable gate
STATE_OF_PLAY refreshed.
Both observability pieces (trace-id propagation, session JSONL)
now exist on both runtimes. Operators who point Rust gateway and
Go validatord at the same session-log path get a unified
longitudinal stream queryable via DuckDB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the second half of J's 2026-05-02 multi-call observability
concern. Trace-id propagation (commit d6d2fdf) gave us the *live*
view in Langfuse; this gives us the *longitudinal* view for ad-hoc
DuckDB queries over thousands of sessions:
"show me every session where the model produced a real candidate
without ever needing a retry"
"find sessions where validation rejected three times in a row"
"first-shot success rate per model — did we feed it enough corpus?"
## What's in
internal/validator/session_log.go:
- SessionRecord type (schema=session.iterate.v1)
- SessionLogger writer — mutex-guarded append, best-effort posture,
nil-safe (NewSessionLogger("") = nil = no-op on Append)
- BuildSessionRecord helper — assembles a row from any
iterate response/failure/infra-error combination, callable from
other daemons that wrap iterate (cross-daemon shared schema)
- 7 unit tests including concurrent-append safety + the three
code paths (success / max_iter_exhausted / infra_error)
cmd/validatord/main.go:
- handlers.sessionLog field + wiring from cfg.Validatord.SessionLogPath
- Iterate handler: build + append a SessionRecord on every call
- rosterCheckFor("fill") closure stamps grounded_in_roster — the
load-bearing forensic property J flagged ("we can never
hallucinate available staff members to contracts")
internal/shared/config.go + lakehouse.toml:
- [validatord].session_log_path field; empty = disabled
- Production: /var/lib/lakehouse/validator/sessions.jsonl
scripts/validatord_smoke.sh:
- Adds a probe verifying validatord announces session log path on
startup. Smoke is now 6/6 (was 5/5).
docs/SESSION_LOG.md:
- Schema reference + 5 worked DuckDB query examples including the
"alarm" query (sessions where grounded_in_roster=false on an
accepted fill — should always be empty; if not, something is
bypassing FillValidator).
## What this is NOT
This is NOT a duplicate of replay_runs.jsonl. They're siblings:
- replay_runs.jsonl: replay tool's per-task retrieval+model output
- sessions.jsonl: validatord's per-iterate full retry chain +
grounded-in-roster verdict
A single coordinator session can produce rows in both streams; the
session_id (= Langfuse trace_id) is the join key.
## Layered observability now in place
Live view: Langfuse trace tree (X-Lakehouse-Trace-Id propagation)
`iterate.attempt[N]` spans with prompt/raw/verdict
Offline: coordinator_sessions.jsonl (this commit)
DuckDB-queryable; longitudinal forensics
Hard gate: FillValidator + WorkerLookup (existing)
phantom IDs structurally rejected, never reach
session log's grounded_in_roster=true bucket
Per the architecture invariant in STATE_OF_PLAY's DO NOT RELITIGATE
section — these layers are wired; future work targets the data, not
the wiring.
## Verification
- internal/validator: 7 new tests (session_log_test.go) — all PASS
- cmd/validatord: 3 new integration tests covering the success,
failure, and grounded=false paths — all PASS
- validatord_smoke.sh: 6/6 PASS through gateway :3110
- Full go test ./... green across 33 packages
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes J's 2026-05-02 multi-call observability gap: a single
/v1/iterate session with N retries used to surface in Langfuse as
N+1 disconnected traces (one per /v1/chat hop + one for the iterate
request itself), with no parent/child linkage. Operators couldn't
scroll the retry chain in one trace tree to spot where grounding
failed.
## Wire-level change
- New header constant `shared.TraceIDHeader = "X-Lakehouse-Trace-Id"`
- `langfuseMiddleware` honors the header on inbound requests: if
set, reuses that trace id instead of minting a new one. Stashes
the trace id on the request context so handlers can attach
application-level child spans.
- `validatord.chatCaller` forwards the header to chatd. Every chat
hop in an iterate session lands as a child of the parent trace.
## Application-level spans
- `validator.IterateConfig` gains `Tracer` (optional callback).
When wired, each iteration attempt emits one Langfuse span
via `validator.AttemptSpan`:
Name: iterate.attempt[N]
Input: { iteration, model, provider, prompt }
Output: { verdict, raw, error }
Level: WARNING when verdict != accepted
- `validatord.iterTracer` is the production hook — bridges
`validator.Tracer` → `langfuse.Client.Span`.
- `IterateRequest`/`IterateResponse`/`IterateFailure` gain
`TraceID`; each `IterateAttempt` gains `SpanID`. The /v1/iterate
caller can pivot from the JSON response straight into the
Langfuse trace tree.
## What an operator sees post-cutover
GET /v1/iterate {kind=fill, prompt=...} → Trace TR-1
├─ http.request span (from middleware)
├─ iterate.attempt[0] span (validator.Iterate emit)
│ input: prompt+model
│ output: { verdict: validation_failed, error: ..., raw }
├─ chatd /v1/chat call (X-Lakehouse-Trace-Id: TR-1)
│ ├─ http.request span (chatd middleware)
│ └─ chatd-internal spans (existing)
├─ iterate.attempt[1] span
└─ ...
All in one Langfuse trace tree, not N+1 separate traces.
## Hallucinated-worker safety net is unchanged
The /v1/iterate flow's hard correctness gate is still
FillValidator + WorkerLookup. Phantom candidate IDs raise
ValidationError::Consistency which 422s and forces the iteration
loop to retry. The trace-id propagation is the OBSERVABILITY layer
on top — it makes the existing safety net's outcomes visible per-call,
not a replacement for it.
## Verification
- internal/validator: 4 new tests
- TestIterate_TracerEmitsSpanPerAttempt — span/attempt count + SpanID
- TestIterate_NoTraceIDSkipsTracer — no orphan spans without trace_id
- TestIterate_ChatCallerReceivesTraceID — propagation contract
- (existing iterate tests updated for new ChatCaller signature)
- internal/shared: 1 new test
- TestLangfuseMiddleware_HonorsTraceIDHeader — cross-service linkage
- cmd/validatord: existing HTTP tests still PASS via the dual-shape
UnmarshalJSON contract.
- validatord_smoke.sh: 5/5 PASS through gateway :3110 (unchanged).
- Full go test ./... green across 33 packages.
## Architecture invariant added
STATE_OF_PLAY "DO NOT RELITIGATE" gains a paragraph documenting
the X-Lakehouse-Trace-Id header contract + the iterate.attempt[N]
span emission. Future-Claude won't re-propose "wire trace-id
propagation" — the header IS the wiring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 2026-05-02 parity finding: validator_parity probe found
5/6 body shapes diverging because Go emitted {"Kind":"...","Field":"...","Reason":"..."}
while Rust emits the externally-tagged-enum {"Schema":{"field":"...","reason":"..."}}.
A caller parsing the error envelope would break silently in cutover.
## Changes
internal/validator/types.go:
- Custom MarshalJSON emits the Rust shape:
Schema: {"Schema": {"field":"x","reason":"y"}}
Completeness: {"Completeness":{"reason":"y"}}
Consistency: {"Consistency": {"reason":"y"}}
Policy: {"Policy": {"reason":"y"}}
- Custom UnmarshalJSON accepts BOTH the new Rust shape AND the legacy
flat shape (migration safety for any persisted error rows).
- Unknown variants (e.g. a future Rust addition Go hasn't learned)
surface as an Unmarshal error, not a silent default.
internal/validator/types_test.go:
- 4 pinning tests anchor the wire format. Failing them = wire-format
drift; the parity probe is the secondary line of defense.
scripts/validatord_smoke.sh:
- Updated probes to read the new variant-name shape (jq keys[0],
.Schema.field) instead of legacy .Kind/.Field.
## Verification
- internal/validator unit tests: PASS (4 new + all existing).
- cmd/validatord HTTP tests: PASS (UnmarshalJSON falls through to flat
shape so existing tests reading ValidationError still work).
- validatord_smoke.sh: 5/5 PASS through gateway :3110.
- validator parity probe re-run: **6/6 match** (was 1/6).
## Pattern
Per architecture_comparison's "use the dual-implementation as a
measurement instrument" thesis: a parity probe surfaced this gap;
50 LOC of MarshalJSON closed it; 4 pinning tests prevent regression;
the probe is the longitudinal gate. Cutover-friendly direction (Go
matches Rust) chosen because Rust is the existing production
contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new cross-runtime parity probes joining the validator probe from
the gauntlet wave. Pattern: feed identical input through Rust and Go;
diff outputs. Each probe surfaced a different signal.
## Materializer parity probe
scripts/cutover/parity/materializer_parity.sh runs Bun + Go
materializer against an identical synthetic data/_kb/ root, diffs the
resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at).
**First run: 0/2 match.** Real finding: Go's Provenance.LineOffset
had `json:"line_offset,omitempty"` which strips the field when value
is 0. Line offset 0 is the FIRST ROW of every source file — a real
semantic value, not absent. Bun side always emits it.
Fix: drop `omitempty` on Provenance.LineOffset. Updated comment
explaining why.
**Re-run: 2/2 match.** On-wire JSON parity holds.
## extract_json parity probe
scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture
strings through both runtimes' extract_json:
- fenced ```json``` blocks
- unfenced ``` blocks
- bare braces with prose around
- first-balanced-of-many
- nested objects
- unicode in string values
- escaped quotes
- empty object
- top-level array (both return first inner object)
- no JSON
- depth-balanced but invalid syntax
- trailing garbage
Substrate gate: cargo test -p gateway extract_json PASS before probe.
**Result: 12/12 match.** Algorithms genuinely equivalent.
## scripts/cutover/parity/extract_json_helper/main.go
Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints
{matched, value} JSON. Counterpart to the Rust parity_extract_json
binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit).
## Pattern crystallized
Every cross-runtime port should land with a parity probe. Three
probes now exist:
- validator (5/6 wire-format gap captured 2026-05-02)
- materializer (caught + fixed real bug 2026-05-02)
- extract_json (12/12 match 2026-05-02)
The instrument is reusable — each new shared HTTP/CLI surface gets
a probe row added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production-readiness gauntlet exploiting the dual Rust/Go
implementation as a measurement instrument.
## Phase 1 — Full smoke chain
21/21 PASS in ~60s. Substrate intact across the full service surface.
## Phase 2 — Per-component scrum (token-volume fix)
Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful
analysis. This wave splits today's commits into 4 focused bundles
(36-71KB each):
c1 validatord (46KB) → 0 convergent / 11 distinct
c2 vectord substrate (36KB) → 0 convergent / 10 distinct
c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted
a BLOCK then self-retracted in same response)
c4 replay (45KB) → 0 convergent / 10 distinct
Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out
once bundles dropped below 60KB.
scripts/scrum_review.sh hardening:
* Diff-size guard (warn >60KB, hard-fail >100KB,
SCRUM_FORCE_OVERSIZE=1 override)
* Tightened prompt — file path must appear EXACTLY as in diff
so post-processor can grep WHERE: lines reliably
* Auto-tally step dedupes by (reviewer, location); convergence
counts distinct lineages (closes the prior `opus+opus+opus`
false-convergence bug)
## Phase 3 — Cross-runtime validator parity probe (the headline finding)
scripts/cutover/parity/validator_parity.sh sends 6 identical
/v1/validate cases to Rust :3100 AND Go :4110, compares status+body.
Result: **6/6 status codes match · 5/6 body shapes diverge.**
Rust returns serde-tagged enum: {"Schema":{"field":"x","reason":"y"}}
Go returns flat exported-fields: {"Kind":"schema","Field":"x","Reason":"y"}
Both round-trip inside their own runtime; a caller swapping one for
the other would break parsing silently. Captured as new _open_ row
in docs/ARCHITECTURE_COMPARISON.md decisions tracker.
This is the "use the dual-implementation as a measurement instrument"
return — single-repo scrums can't catch this class of cross-runtime
drift.
## Phase 4 — Production assessment
ship-with-known-gap. Validator wire-format gap is documented, not
regressed. ~50 LOC future fix on Go side (custom MarshalJSON on
ValidationError to match Rust's serde shape).
Persistent stack config (/tmp/lakehouse-persistent.toml) gains
validatord on :3221 + persistent-validatord binary so operators
bringing up the persistent stack get the new daemon automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two threads landing together — the doc edits interleave so they ship
in a single commit.
1. **vectord substrate fix verified at original scale** (closes the
2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
Throughput dropped 1,115 → 438/sec because previously-broken
scenarios now do real HNSW Add work — honest cost of correctness.
The fix (i.vectors side-store + safeGraphAdd recover wrappers +
smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
footprint that originally surfaced the bug.
2. **Materializer port** — internal/materializer + cmd/materializer +
scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
(12 transforms) + build_evidence_index.ts (idempotency, day-partition,
receipt). On-wire JSON shape matches TS so Bun and Go runs are
interchangeable. 14 tests green.
3. **Replay port** — internal/replay + cmd/replay +
scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
(retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
phase 7 live invocation on the Go side. Both runtimes append to the
same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.
Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.
STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J said "let's go" → "next" (option 3): actual flip via Bun
mcp-server. Done. Real Bun-frontend traffic now reaches the Go
substrate via /_go/* on Bun :3700, routed to the persistent Go
gateway at :4110.
Companion change in /home/profit/lakehouse (Rust legacy):
mcp-server/index.ts: new /_go/* pass-through, opt-in via
GO_LAKEHOUSE_URL env var. Off-by-default (returns 503 on
/_go/* with rationale). Existing /api/* (Rust gateway) path
unchanged. Committed locally on the demo/post-pr11 branch.
System config:
/etc/systemd/system/lakehouse-agent.service.d/go-cutover.conf
adds Environment=GO_LAKEHOUSE_URL=http://127.0.0.1:4110 to
the systemd-managed Bun service. Reversible via systemctl
revert lakehouse-agent.
Live verification (operator curl through Bun frontend):
- /_go/health: gateway responds {"status":"ok","service":"gateway"}
- /_go/v1/embed: nomic-embed-text-v2-moe vectors, dim=768
- /_go/v1/matrix/search vs persistent 200-worker corpus:
rank=0 id=w-43 Brian Ramirez (Forklift Operator, Springfield IL)
rank=1 id=w-102 Laura Long (Forklift Operator, Cleveland OH)
rank=2 id=w-101 Terrence Gray (Forklift Operator, Champaign IL)
3/3 role match, top-1 in IL exactly
- /api/health: lakehouse ok (Rust path unchanged — control verified)
What this is NOT:
- Not an nginx flip — devop.live/lakehouse/* still goes through
/api/* → Rust :3100. /_go/* is parallel slice for opt-in.
- Not a tool-level cutover — each /_go/<path> is a manual choice;
no automatic mapping of Rust paths to Go equivalents.
- Not a transformation layer — caller sends Go-shaped requests
(e.g. /_go/v1/embed expects {texts, model}, not {text}).
Three cutover unit properties verified:
- ADDITIVE: zero modification to any existing Bun tool
- REVERSIBLE: unset GO_LAKEHOUSE_URL → /_go/* → 503
- ISOLATED: Rust gateway state unaffected (different port,
different binary, different MinIO bucket)
This is the cutover slice operators can use to validate Go-side
handlers under realistic frontend conditions before any
production-traffic flip. Next step (deferred): pick a specific
mcp-server tool to optionally route through Go with response-
shape adapter — that's a product-visible flip rather than this
infrastructure-visible slice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J's "let's go" instruction: leave OPEN list behind, push the Go
substrate forward into actual deployment shape. This commit marks
the first time the Go side has run as long-running daemons rather
than per-harness transient processes, and the first time the
shared cross-runtime longitudinal log has carried a Go-emitted
entry alongside the Rust ones.
What landed:
scripts/cutover/start_go_stack.sh — the persistent-stack runbook.
Brings up all 11 daemons (storaged → catalogd → ingestd → queryd
→ embedd → vectord → pathwayd → observerd → matrixd → gateway,
plus chatd-if-not-already-up) in dependency order via nohup +
disown. Anchored pkill per feedback_pkill_scope (never bare
"bin/"). Logs land in /tmp/gostack-logs/<bin>.log, one per daemon.
Verified live state:
- All 11 services healthy on :3110 + :3211-:3220
- gateway → embedd proxy returns nomic-embed-text-v2-moe vectors
- chatd reports 5/5 providers loaded
- No port collision with Rust gateway on :3100
- Daemons stay up after exit of the start script (production shape,
not harness-transient)
audit_baselines.jsonl crosses the runtime boundary:
- 7 Rust-emitted entries (last: ca7375ea 2026-04-27)
- 1 Go-emitted entry (ee2a40c 2026-05-01T07:53:54Z) appended via
./bin/audit_full -append-baseline
- Same envelope shape, same metric set, same drift comparator
semantics — operators running either runtime grow the same log
What this DOES prove:
- Substrate parity at deployment shape (not just unit tests)
- Cross-runtime artifact write-side compatibility (was previously
proven on read side via audit_baselines roundtrip)
- The deploy machinery works end-to-end for the persistent case
What this does NOT prove (still ahead):
- Real coordinator traffic against the Go stack (no nginx flip yet;
devop.live/lakehouse/ still serves through Rust)
- Go-side production materializer (Phase 2 is observer-only)
- Replay tool parity (Phase 7 is observer-only)
- The 5-loop product gate against actual humans
reports/cutover/SUMMARY.md now logs three new rows:
- audit-FULL with 12/12 phases ported
- First Go-emitted audit_baselines entry
- Persistent Go stack live
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes 4 of the 5 phases the initial audit-FULL port left as
deferred. The pattern: most "deferred" phases didn't actually need
the un-ported Rust pieces — they were observer-mode by design and
just needed to read existing on-disk artifacts.
Phase 1 (schema validators) → ported via exec.Command:
Invokes `go test ./internal/distillation/...` — the Go equivalent
of Rust's `bun test auditor/schemas/distillation/`. New
GoTestModule field on AuditFullOptions controls the package
pattern; empty disables the invocation (test mode, prevents
recursion when audit-full is invoked from inside `go test`).
Phase 2 (evidence materialization) → ported as observer:
Reads data/evidence/ directly and tallies rows + tier-1 source
hits. Doesn't re-run the materializer (which is Rust-side TS).
Emits p2_evidence_rows + p2_evidence_skips metrics matching
Rust shape — drop-in audit_baselines.jsonl entries possible.
Phase 5 (run summary) → ported as observer:
Reads reports/distillation/{run_id}/summary.json + 5 stage
receipts. Validates schema_version=1, run_hash sha256, git_commit
40-char hex, all stage receipts decode as JSON. Full schema
validation (StageReceipt schema) is intentionally NOT ported —
it would require porting the TS schemas/distillation/ validators
in full; basic shape checks catch the load-bearing invariants.
Phase 7 (replay log) → ported as observer:
Reads data/_kb/replay_runs.jsonl, validates last 50 rows parse
as JSON. Skips the live-replay invocation that Rust's phase 7
also does — porting Rust replay.ts is substantial and not in
scope. The "log shape sanity" check is what audit-full actually
needs; the live invocation is a separate concern.
Phase 6 (acceptance gate) — STILL SKIPPED:
Rust acceptance.ts is a TS-only fixture harness with bun-specific
deps. Porting the fixtures (tests/fixtures/distillation/acceptance/)
+ the 22-invariant runner to Go is an ADR-worth undertaking.
Documented in the header comment.
Live-data probe (against /home/profit/lakehouse):
Skips count: 4 → 1 (only phase 6).
Required checks: 6/6 → 12/12 PASS.
New metric: p2_evidence_rows=1055, BYTE-EQUAL to the Rust
pipeline's collect.records_out from the latest summary.json.
Cross-runtime parity now extends across phases 0/1/2/3/4/5/7.
6 new tests:
- TestPhase2_EvidenceTallyFromOnDisk: row + tier-1-hit tallying
- TestPhase5_FullSummaryFlow: complete run-summary fixture passes
- TestPhase5_ShortRunHashCaught: bad run_hash fails required check
- TestPhase7_ReplayLogReadsFromDisk: row-count reporting
- TestPhase7_MalformedTailRowsCaught: structural parse failure
- TestRunAuditFull_FullFixtureFlow updated to seed evidence/ +
reports/distillation/ for the phases now wired.
Cleanup: removed local sortStrings helper (replaced with sort.Strings
now that `sort` is imported for phase 5's mtime-sort).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum on 434f466..0d4f033 surfaced one convergent finding
(Opus + Kimi) and 3 Opus-only real bugs. All actioned in this
commit. Two false positives (Kimi rollback misreading, Opus stale-
comment claim) verified + rejected — both required manual control-
flow inspection to refute, matching the documented Kimi-truncation
behavior in feedback_cross_lineage_review.md.
Convergent fix — DecodeIndex lost nil-meta items:
- Envelope version bumped 1 → 2.
- New v2 field: IDs []string carries the canonical ID set
explicitly, independent of meta map's nil-vs-{} sparseness.
- DecodeIndex accepts both versions: v2 reads from env.IDs; v1
falls back to meta-key inference (with the documented
limitation that nil-meta items are invisible — preserved for
backward-compat with already-persisted indexes).
- Encode emits v2 going forward.
- 2 new regression tests:
- TestEncodeDecode_NilMetaItemsSurviveRoundTrip: items added
with nil metadata MUST survive Encode → Decode and remain
visible to IDs(). Pre-fix would have yielded IDs() == [].
- TestDecodeIndex_V1BackwardCompat: hand-crafted v1 envelope
still decodes (proves the fallback path).
Opus-only fixes:
- handleMerge: non-ErrIndexNotFound errors at h.reg.Get(name) /
h.reg.Get(req.Dest) now return 500 + log instead of falling
through with nil src/dest pointers (which would panic on the
next deref). Real bug — only the sentinel error was handled.
- internal/drift/drift.go: mathLog wrapper removed; math.Log
inlined. Wrapper added no value (math was already imported).
- internal/distillation/audit_baseline.go: BuildAuditDriftTable's
bubble sort replaced with sort.Slice. Idiomatic + shorter.
Rejected after verification:
- Kimi WARN "missing rollback on partial merge": misread the
control flow. Code at cmd/vectord/main.go:404-414 does NOT
delete from src when dest.Add fails (continue before reaching
src.Delete). Only successful Adds trigger Deletes.
- Opus INFO "TimestampUnixNano comment references missing field":
field exists at scripts/multi_coord_stress/main.go:128. Opus
saw only the diff context, not the full file.
Deferred (no fired trigger):
- Opus WARN "no per-index lock during merge": no concurrent merge
callers today (operators run merge as deliberate one-shot job).
Worth a lock if/when matrixd or chatd start auto-triggering.
Disposition: reports/scrum/_evidence/2026-05-01/verdicts/post_role_gate_v1_disposition.md.
Build + vet + tests green; 2 new regression tests + all prior tests
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original OPEN #2 line called for "SFT export pipeline +
audit_baselines lineage." Commit 7bb432f shipped the SFT export.
This commit ports the audit_baselines half — the longitudinal
drift signal that distinguishes "metrics shifted because the world
changed" from "metrics shifted because we broke something."
Mirrors Rust scripts/distillation/audit_full.ts's substrate:
- LoadLastBaseline(path) reads the most recent entry from
data/_kb/audit_baselines.jsonl. Returns (nil, nil) on missing
file (first run), errors on truncated last line (partial-write
detection — operators don't lose drift signal silently).
- AppendBaseline(path, baseline) appends one entry as a JSON line.
Atomic at the line level via bufio + O_APPEND. Creates the
parent directory if missing.
- BuildAuditDriftTable(prior, current, threshold) computes
per-metric drift. flag values mirror Rust exactly: first_run,
ok, warn. DefaultDriftWarnThreshold = 0.20 = Rust's 20%.
- FormatAuditDriftTable renders a fixed-width text grid for
stdout dumps in audit-full runs.
Edge cases handled:
- Zero-baseline: prior=0 means no division — PctChange stays nil.
current=0 → ok (no change). current>0 → warn (zero→nonzero is
always notable, never silently fine).
- New metric in current: flagged first_run, not "0%-change".
Operators see "this is a new signal we haven't tracked before."
- Sort: stable by metric name for deterministic JSON output and
clean CI diffs.
Generic on metric name (vs Rust's pinned p2_evidence_rows etc.):
the Rust phase numbering doesn't translate to Go directly. The
AuditBaselineRustCompat constant pins the Rust names so operators
running both runtimes use the same labels, which makes drift
comparison meaningful across the two pipelines.
13 new tests covering: missing file, last-line-wins, blank-line
tolerance, malformed-line errors, append round-trip, append-to-
existing, schema validation, first-run, threshold boundary,
zero-baseline, new-metric-in-current, sort-by-metric stability,
formatter output rendering.
OPEN #2's "audit_baselines lineage" half now closed. The
distillation package surface is at parity with the Rust pipeline:
scorer, scored runs, SFT export, audit baselines all available
on the Go side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to b216b7e (which shipped the SFT export substrate). This
commit ports the synthesis logic, completing the migration:
- SynthesizeSft(scored, ev, recordedAt, sftID) → *SftSample
Mirrors the Rust synthesizeSft byte-for-byte. Returns nil for
extraction-class records + empty-text records (same skip
semantics as Rust).
- LoadEvidenceByRunID(scoredPath, cache) reads the paired evidence
JSONL (path derived by /scored-runs/ → /evidence/ replacement).
Per-call cache so multiple scored-runs files in the same dir
don't reload the same evidence.
- buildInstruction maps source_file stem → per-class instruction
template. All 8 templates (scrum_reviews, mode_experiments,
auto_apply, audits, observer_reviews, contract_analyses,
outcomes, default) match Rust output exactly so a/b validation
between runtimes can diff JSONL byte-for-byte.
- stemFromSourceFile strips data/_kb/ prefix + .jsonl suffix.
- ExportSft now writes data/distilled/sft/sft_export.jsonl with
the synthesized samples (DryRun=true skips file write).
Per-class templates verified by 8-case sub-test:
- scrum_reviews → "Review the file '...' against the PRD..."
- mode_experiments → "Run task_class='...' for file..."
- auto_apply → "Auto-apply: emit a 6-line surgical patch..."
- audits with phase: prefix → strips to bare phase name
- observer_reviews → "Observer-review the latest attempt..."
- contract_analyses with permit: prefix → strips to permit ID
- outcomes → "Run scenario; report per-event outcome..."
- unknown source → "Source 'X' run; produce the appropriate output"
Caveat documented inline: contract_analyses uses ev.metadata.contractor
in Rust to produce "Analyze contractor 'X' for permit 'Y'" when
present. Go's EvidenceRecord doesn't carry a free-form metadata bag
yet, so we always emit the no-contractor form. Operators needing
contractor-aware instructions can extend EvidenceRecord with an
explicit Metadata field (separate ADR).
Test additions (5 new):
- TestSynthesizeSft_PerSourceClass: 8 sub-cases, one per template
- TestSynthesizeSft_RejectsExtraction: extraction-role records skipped
- TestSynthesizeSft_RejectsEmptyText: empty/whitespace text skipped
- TestSynthesizeSft_ContextAssembly: matrix + pathway + model
context string formatting matches Rust " · " join
- TestExportSft_FullPort_WritesJSONL: end-to-end fixture, asserts
output contains expected instruction + omits firewalled records
Pre-existing TestExportSft_PartialPort_FirewallFires renamed +
updated to TestExportSft_FirewallFiresBeforeEvidenceLoad — reflects
the new contract that records passing the firewall but lacking
evidence land in "not-instructable" rather than being silently
exported. Honest semantics shift documented in the test.
OPEN #2 now fully closed (was: substrate-only). The synthesis path
no longer requires the Rust pipeline to be invoked — Go-side
operators can run the full distillation export end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substantial wave addressing all 4 prior OPEN items. Three closed in
full, one partially (the speculative half deliberately deferred).
OPEN #1 — Periodic fresh→main index merge (FULL):
- POST /v1/vectors/index/{src}/merge with {dest, clear_source}
- Idempotent on re-runs (existing-in-dest items skipped)
- internal/vectord/index.go: new Index.IDs() snapshot method +
i.ids tracker field as canonical ID set, independent of meta
map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta
alone missed items added with nil metadata)
- 4 cmd-level integration tests (happy path drain+clear, dim
mismatch, dest not found, self-merge rejection) + 1 unit test
- DecodeIndex backward-compat: old envelopes restore i.ids from
meta keys (best effort; new items going forward use the tracker)
OPEN #2 — Distillation SFT export (SUBSTRATE):
- internal/distillation/sft_export.go ports the load-bearing half:
IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/
MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft.
- Synthesis (instruction/input/response generation) deferred to a
separate wave — too big for this session, but the substrate
makes the next wave a port-not-design exercise.
- TestSftNever_PinsExpectedSet locks the contamination firewall
set: if a future commit adds/removes from SftNever, this test
fails — forcing the change through review.
- 5 new tests; firewall fires end-to-end through the partial port.
OPEN #3 — Distribution drift via PSI (FULL):
- internal/drift/drift.go: ComputeDistributionDrift via Population
Stability Index. Standard finance/risk metric, well-defined
verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25).
- Equal-width bucketing over combined min/max so neither dist
falls outside; epsilon-clamping for empty buckets so log doesn't
blow up. Per-bucket breakdown for drilldown.
- Pairs with the existing ComputeScorerDrift: scorer drift is
categorical, distribution drift is continuous. Different shapes,
same package.
- 7 new tests covering identical-is-stable, hard-shift-is-major,
moderate-detected-not-stable, empty-inputs-safe, all-identical-
safe, bucket-counts-conserved, num-buckets-clamping.
OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others
deferred):
- (a) Real-time wall-clock for stress harness: per-phase elapsed
time logged to stdout as it runs (`[stress] phase NAME starting
(T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`).
Output.PhaseTimings + Output.TotalElapsedMs in JSON.
- (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase
calibration: not actioned — no fired trigger, would be
speculative. Documented as deferred-until-need rather than
ignored. Per the project's discipline ("don't add features
beyond what the task requires").
OPEN list now empty / steady-state. Future items will land as
production triggers fire.
Build + vet + tests green; 18 new tests across the 4 closures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real wire-up gap discovered post-scrum: Demand.Role was already
extracted at every call site in multi_coord_stress (44 occurrences,
both contract-driven and LLM-parsed inbox-triggered paths), but
neither matrixSearch nor playbookRecord accepted role in their
signatures. Cross-role gate (real_001..real_004 work) was bypassed
for the entire multi-coord harness — recordings and queries went
through with empty role, gate fell back to lenient behavior.
Fix:
- matrixSearchReq gains query_role field
- matrixSearch signature: (..., query, role string, ...)
- tracedSearch wrapper gains role param + emits it in span input
metadata for Langfuse visibility
- playbookRecord signature: (..., query, role, ...) — body emits
role only when non-empty (preserves backward compat at API)
- 14 call sites updated:
contract-driven Demand loops → d.Role
LLM-parsed inbox path → parsed.Role (qwen2.5 already extracts it)
swap path (warehouseDemand) → warehouseDemand.Role
reissue path → ev.Role (captured at original event time)
fresh-verify (resume snippet, no role concept) → ""
Build clean, vet clean, all tests pass. Cross-role gate now fires
end-to-end across the multi-coord harness — matches the playbook_lift
harness's coverage from the original real_001 fix.
This closes the symmetric gap to scripts/playbook_lift's existing
wire-through. Both production-shape harnesses now exercise the role
gate; future reality tests automatically inherit the protection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".
Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.
Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK
The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.
No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
(already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
pretend to honor unparseable constraints
Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.
Findings: reports/reality-tests/real_005_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum review of the role-gate work (commits 7f2f112..0331288)
ran Opus 4.7 / Kimi K2.6 / Qwen3-coder via scripts/scrum_review.sh.
All three flagged the same edge case: the homegrown plural-stripper
in roleNormalize would collapse non-plural-s tokens like "Sales" →
"Sale", "Logistics" → "Logistic", "Operations" → "Operation". In a
staffing domain those are real role names; the silent normalization
would have caused false role-equality matches and re-opened the
cross-role bleed for those clusters.
Fix:
- nonPluralSWords allowlist for known staffing-domain non-plural-s
tokens (sales, logistics, operations, facilities, premises, news,
physics, economics, mathematics, analytics).
- Last-word-only stripping ("Sales Associate" stays whole; only
"Associates" head noun is plural-checked).
- -ss ending check so "Press Operator", "Boss" don't lose their s.
- strings.ToLower + strings.TrimSpace replace the homegrown rune-
loop ASCII normalizer (Opus INFO — minor cleanup, folded in).
Tests:
- TestRoleNormalize_NonPluralS: 18 cases covering the allowlist,
-ss ending, real plurals (Operators → Operator, Boxes → Box),
multi-word real plurals (Forklift Operators → forklift operator),
whitespace/case tolerance.
- TestRoleEqual_NonPluralS: gate-level pairing — proves equal-
shape allowlisted tokens compare equal AND that "Sales" ≠ "Sale"
(the original bug shape).
- Existing TestRoleEqual_PluralAndCase still green (refactor
preserved behavior).
Other scrum findings dispositioned (not actioned):
- Opus WARN on empty-role fail-open semantics: documented
backward-compat behavior; production path closes via opt-in LLM
extractor (real_004).
- Opus INFO on unsynchronized package-global cache map: harness is
single-goroutine; add sync.Mutex when/if it parallelizes.
- Opus INFO on parallel constructor (NewPlaybookEntryWithRole vs
optional arg): API smell only, both forms preserved.
- Kimi 2 BLOCKs (NewPlaybookEntryWithRole missing, ApplyPlaybookBoost
signature breakage): FALSE positives. Pre-push smoke chain green
on 0331288, both symbols + all call sites compile clean. Matches
feedback_cross_lineage_review.md's documented Kimi truncation
behavior — Kimi BLOCKs warrant trace verification before action.
Disposition (local): reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.
This commit adds a roleExtractor with regex-first + LLM fallback:
- Regex first (fast, deterministic) — handles need + client_first +
looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape /api/chat with format=json, schema-tight prompt,
temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
config unchanged unless explicitly enabled.
8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
witness for the real_003 scenario — both record + query are
shorthand, regex returns "" for both, LLM produces DIFFERENT
role tokens for CNC vs Forklift, so matrix gate's cross-role
rejection (locked separately in
TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
correctly. This is the load-bearing verification.
Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.
Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.
real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.
Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.
real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.
scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.
scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).
Aggregate metrics:
- real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).
Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
real_001 surfaced same-client+city queries bleeding across roles:
Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193
in the playbook corpus. Q#5 (Pickers same client+city) and Q#10
(CNC Operator same client+city) embedded within 0.13-0.18 cosine of
Q#2's query — well inside the 0.20 inject threshold — so e-6193
injected on both, demoting the cold-pass-correct workers.
Root cause: the inject distance threshold isn't tight enough on
the same-client+city cluster. Cosine collapses queries that share
city + client + count-token + time-token regardless of role. The
existing judge gate is per-injection at record time and doesn't
fire at retrieve time.
Fix: structural role gate in front of both Shape A boost and
Shape B inject. PlaybookEntry gains Role; SearchRequest gains
QueryRole. When both are non-empty and differ under roleEqual's
case+plural normalization, the entry is rejected before BoostFactor
or judge-gate logic runs.
Backward-compat: empty role on either side disables the gate —
preserves behavior for the lift suite's free-form multi-constraint
queries that have no clean single role. Caller-supplied (not
inferred), so existing recordings unaffected.
Wire-through:
- internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole,
roleEqual helper with plural+case normalization
- internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded
to both ApplyPlaybookBoost + InjectPlaybookMisses
- cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk
- scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls
role from "Need N {role}{s} in" queries (the fill_events shape);
free-form queries fall back to empty (gate disabled)
Tests (5 new):
- TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10
scenario (distance 0.135, recorded "Forklift Operator", query
"CNC Operator") — locks the bleed at unit level
- TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator
recording fires on Forklift Operators query (plural normalization)
- TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on
either side = gate disabled, preserves current behavior
- TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense
in depth — boost doesn't fire on cross-role even when answer is
in cold top-K
- TestRoleEqual_PluralAndCase: case + -s + -es plural normalization
Verification (real_002, same query set as real_001):
- Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed)
- Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed)
- Discoveries + lifts unchanged at 2 each (same-role lift still fires)
- Mean Δdist tightens from -0.127 to -0.040 (boosts no longer
pulling distances through the floor on cross-role mismatches)
Findings: reports/reality-tests/real_002_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".
Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.
Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.
Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.
Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.
Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.
Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.
Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading
Repro:
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First concrete cutover artifact: scripts/cutover/embed_parity.sh
brings up Go embedd + gateway alongside the live Rust gateway,
hits both /ai/embed and /v1/embed with the same forced model, and
emits a per-date verdict report under reports/cutover/.
Why embed first: the parity invariant is one math identity (cosine
sim of vectors against same input). Retrieve has thousands of edge
cases. If embed parity holds, all downstream vector consumers
inherit confidence; if it doesn't, we catch it in 30s instead of
after a flip.
Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced
to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both
Ollamas have it loaded). Math is provably equivalent across the
gateway plumbing.
Drift catalog (reports/cutover/SUMMARY.md):
- URL: Rust /ai/embed vs Go /v1/embed
- Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors,
dimension} (singular). Wire-format adapter is the only real
cutover work for this endpoint.
- L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same
direction (cos=1.0); harmless under cosine-distance HNSW (which
is Go vectord's default), but worth fixing in internal/embed/
before extending to euclidean indexes.
reports/cutover/ now tracked (joined the scrum/ + reality-tests/
exemptions in .gitignore).
Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the
real user-facing retrieve path. Embed parity gives that probe a
clean foundation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds langfuseMiddleware in internal/shared so every daemon's
shared.Run gets free production-traffic trace visibility when
LANGFUSE_URL + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY are set.
Same env names + file shape as the multi_coord_stress driver, so
operators ship one /etc/lakehouse/langfuse.env across the deploy.
Wiring is auth-gated: middleware runs INSIDE the RequireAuth group,
so 401s from credential-stuffing don't pollute traces. /health is
exempt so LB probes don't either. Missing env vars → nil client →
middleware is a passthrough no-op (fail-open per ADR-005 5.1).
Bundled deploy:
- langfuse.env.example template (mode 0640, root:lakehouse)
- 11 systemd units gain `EnvironmentFile=-/etc/lakehouse/langfuse.env`
(leading - so missing file = OK)
- REPLICATION.md bootstrap section documents setup
Tests (4): nil passthrough, /health bypass, real-request emission,
status-writer wrapping. All green.
STATE_OF_PLAY OPEN list: 5 rows → 4 rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift suite run #004 left two unresolved tail issues:
- Q6 ("Forklift loader") ↔ Q7 ("Hazmat warehouse, cold storage")
swap recordings as warm top-1 because their embeddings are within
0.20 cosine of each other. Distance gate can't tell them apart.
- Q9 + Q15 lose paraphrase recovery when qwen2.5 rephrases past the
0.20 threshold. Distance says "drift too far"; sometimes the drift
is real (skip), sometimes the paraphrase is still on-domain (don't
want to skip).
Multi-coord run #008's judge re-rating proved the LLM can
distinguish: Q3 crane case landed at distance 0.23 (looks tight)
but rating 1 (irrelevant). The judge sees domain mismatch the
embedder doesn't.
This commit lifts that pattern into the matrix substrate. Shape B
inject now optionally routes every candidate through a judge gate
before the rank insert lands. Distance + judge BOTH have to approve.
internal/matrix/playbook.go:
- InjectPlaybookMisses signature gains a query string + an
optional InjectGate. nil gate preserves pre-judge-gating
behavior (current tests already pass with nil).
- New InjectGate interface + InjectGateFunc adapter for tests
and non-LLM callers.
- Per-candidate gate.Approve(query, hit) call inserted between
the dedup and the inject. Rejected candidates skip silently;
injected count reflects post-gate decision.
internal/matrix/judge.go (new, ~140 lines):
- LLMJudgeGate calls an Ollama-shape /api/chat endpoint with the
same 1-5 staffing-rubric prompt that worked in multi_coord
run #008. fail-closed on HTTP/JSON errors (don't inject if
judge can't speak — better miss than wrong-domain).
- NewLLMJudgeGate returns nil when URL or Model is empty,
matching InjectGate's nil-means-no-judge semantics.
internal/matrix/retrieve.go:
- SearchRequest gains JudgeURL, JudgeModel, JudgeMinRating
fields. Run() builds an LLMJudgeGate when set; passes nil
otherwise. Backward compatible — existing callers see no
behavior change.
Tests:
- TestInjectPlaybookMisses_GateRejectsCandidate (rejectAll → 0
injected, even with tight distance)
- TestInjectPlaybookMisses_GateApprovesCandidate (approveAll →
same as nil-gate behavior)
- TestInjectPlaybookMisses_GateSeesCorrectQuery (gate receives
CURRENT query + RECORDED query separately so it can score
the (current, candidate) pair)
- All 5 existing inject tests updated to new signature
go test ./internal/matrix → all 8 inject tests pass.
go test ./internal/matrix ./internal/shared ./cmd/{matrixd,
queryd,pathwayd,observerd} → all green.
STATE_OF_PLAY:
- OPEN item #1 (judge-gated injection) closed.
- DO NOT RELITIGATE adds the substrate-level judge-gate lock.
- OPEN list now 5 rows (was 6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sprint 4 row removed (shipped: a59ef5b systemd + 54a05d9 docker).
ADR-006 row already dropped on the previous STATE update.
Two lift-suite tail items (Q6↔Q7 adjacent-query, Q9/Q15 liberal-
paraphrase) consolidated into one "judge-gated playbook injection"
row — both are downstream of the same fix (let the judge approve
before Shape B inserts). Captures the design lineage from
multi-coord run #008's judge-rating pattern.
Three items folded into a single "operational nice-to-haves" row:
real-time clock, chatd fixture storage half, liberal-paraphrase
calibration. None are product-blocking; each lights up when
someone hits its specific trigger.
Reorder reflects leverage on the active product theory (multi-
coord staffing co-pilot via the 5-loop substrate), not effort:
1. Judge-gated injection (lift quality + lift-tail closure)
2. Wider Langfuse instrumentation (production observability)
3. Fresh→main merge (operational hygiene as the corpus grows)
4. Distillation full port (production dependency, not yet)
5. Drift quantification (research)
6. Operational nice-to-haves
Lead-in note added: "Items move to closed when the work demands
them, not on a calendar." Locks intent against future drift toward
a sprawling todo list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.
Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
loaded by systemd EnvironmentFile=. New TokenEnv field on
AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
on AuthConfig — both primary and any secondary pass auth
during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
in-process. Daemons stay HTTP-only; internal traffic stays
on private subnets per Decision 6.3.
Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
SecondaryTokens fields. New resolveAuthFromEnv() called by
LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
for primary + every secondary token; per-request constant-time
compare walks the slice. Fast path is 1 compare (primary).
Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
for the case where primary was just promoted-to-empty mid-rotation.
go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anchor was last touched at v4 split-threshold; since then the
multi-coord stress harness landed end-to-end across 11 commits.
Future sessions reading this file need to see the verified state,
not derive from git log.
Major additions:
- New "Multi-coordinator stress test (Phase 1 → 3)" section in
VERIFIED WORKING. 11-row capability table covering per-coord
playbook isolation, diversity metrics, paraphrase handover,
ExcludeIDs swap, fresh-resume two-tier, inbox endpoints, LLM
demand parsing, judge re-rating, Langfuse tracing.
- Substrate-gains list under that section: ExcludeIDs on
SearchRequest, observer.SourceInbox + /observer/inbox,
internal/langfuse client, embedd default bumped to v2-moe,
two-tier fresh_workers index pattern.
- Last-verified bumped to 16:42 CDT on the run #011 anchor.
DO NOT RELITIGATE expanded with five new locks:
1. Boost / inject use SEPARATE thresholds (0.5 / 0.20)
2. Multi-coord product theory is empirically VALIDATED
3. Fresh content uses two-tier indexing (fresh_workers)
4. embedd.default_model = nomic-embed-text-v2-moe (don't downgrade)
5. Inbox flow: parse + search + judge + trace
6. Langfuse Go-side client lives at internal/langfuse/
OPEN list refresh:
- Removed: re-judge metric (shipped as b13b5cd), adjacent-query as
separate item (folded into a single "judge-approves-before-inject"
follow-up), liberal-paraphrase (kept).
- Added: real-time 48-hour clock, wider Langfuse instrumentation,
periodic fresh→main merge job.
RECENT VERIFIED WAVE table extended with 11 commits (b13b5cd..5d49967).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reality test table extends from #001-#003 to #001-#004; v4 row marked
as "the honest configuration" because OOD cross-pollination is gone.
- Shape B section gains the split-threshold rationale (boost safe at
loose, inject structurally riskier so tighter).
- Verbatim drop framing rewritten — v3→v4 is configuration evolution,
not regression.
- OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math
item (Shape B + split threshold addressed both). Replaced with two
finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be
correct, verify with v4 re-judge metric) and liberal-paraphrase
recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20).
- RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reality test section now spans v1/v2/v3 across one table — the
product story (boost-only verbatim → paraphrase gap → Shape B
closes the gap) is legible without reading the reports.
- Verbatim-lift drop v1→v3 (7→2) explicitly framed as
cross-pollination, NOT regression — and filed as v4 re-judge metric
in OPEN.
- "DO NOT RELITIGATE" gains: Shape B is the stance now (don't revert
to boost-only); local_judge stays on qwen2.5 (don't bump to qwen3.5
for cleanliness — vision-SSM cost geometry).
- OPEN list: removed the now-closed paraphrase v2 row + the boost-math
Q15 row (Shape B may have addressed it; flagged for verify after v4).
Added v4 re-judge metric and Shape B injection cap/decay design call.
- RECENT VERIFIED WAVE adds the four new commits past 6c02c90
(2c71d1c, 9ce067b, e9822f0, 154a72e).
- Matrix indexer §5/5 component description now references
InjectPlaybookMisses + the run #002→#003 evidence chain.
- [models] tier registry comment locks the local_judge=qwen2.5 choice
with the rationale inline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.
The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:
5.1 Persist failure is logged-not-fatal; ring is in-flight source of
truth. Persist-required mode deferred to a future opt-in ADR.
5.2 Mode failure → Success=false, no panic-swallow path. The runner
catches mode errors and surfaces them via node.Error; downstream
consumers see failures explicitly rather than as fake successes
(the Rust anti-pattern surface).
5.3 One row per node, recorded post-run. A workflow with N nodes
produces N audit rows, never a per-workflow catch-all that
survives partial crashes. Known gap: recording happens after
runner.Run returns (acceptable for short workflows; streaming
callback is the right shape when workflows get longer).
5.4 /observer/event accepts on full ring (oldest evicted). Refusing
to write would translate every burst into client errors — wrong
direction for an audit witness.
Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.
Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:
1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
wrong domain for staffing queries; replaced with ethereal_workers
(10K rows, real staffing schema, "e-" id prefix to avoid collision
with workers' "w-"). staffing_workers driver gains -index-name +
-id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
~30s per judge call against the lift loop; reverted to
qwen2.5:latest (~1s/call, 30× faster, held lift theory)
Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:
- cmd/matrixd/main_test.go (new) — playbook record drift detector +
score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.
Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
Queries 21 (staffing-domain, 7 categories)
Discoveries 8 (judge ≠ cosine top-1)
Lifts 7/8 (87.5%)
Boosts triggered 9
Mean Δ distance -0.053 (warm closer than cold)
OOD honesty dental/RN/SWE rated 1, no fake matches
Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SPEC §1 component table: add chatd row marked DONE; replaces
Rust gateway's v1::ollama_cloud / openrouter / opencode adapters
+ the aibridge crate.
- SPEC §3.9 — chatd shipped: 5-provider routing (ollama, ollama_cloud,
openrouter, opencode, kimi) by model-name prefix or :cloud suffix.
Captures the Anthropic 4.7 temperature-deprecation quirk + the
local-Ollama think=false default that the playbook_lift judge
needed. Mentions scrum_review.sh as the reusable cross-lineage
vehicle eating chatd's own /v1/chat.
- SPEC §3.10 — local-review-harness sibling tool: separate repo at
git.agentview.dev/profit/local-review-harness, MVP shipped today.
Documents the cross-pollination plan for when both substrates
stabilize (chatd as the harness's LLM backend; harness findings
as Lakehouse pathway-memory drift signal; .memory/known-risks
as a matrix corpus). Explicit "don't re-port" so future Claudes
don't try to absorb the harness into Lakehouse.
- STATE_OF_PLAY.md: SIBLING TOOLS section with 1-line summary
+ pointer to SPEC §3.10.
No code changes. just verify still PASS — touched only docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>