golangLAKEHOUSE

Author	SHA1	Message	Date
root	277884b5eb	multitier_100k: 335k scenarios @ 1,115/sec against 100k corpus, 4/6 at 0% fail J asked for a much more sophisticated test using the 100k corpus from the Rust legacy database. This commit ships: scripts/cutover/multitier/main.go — 6-scenario harness with weighted random selection per goroutine. Mixes search, email/SMS/fill validators (in-process via internal/validator), profile swap with ExcludeIDs, repeat-cache exercise, and playbook record/replay. Scenarios + weights (cumulative scenario fractions): 35% cold_search_email — search + email outreach + EmailValidator 15% surge_fill_validate — search + fill proposal + FillValidator + record 15% profile_swap — original search + ExcludeIDs swap + no-overlap check 15% repeat_cache — same query × 5 (cache effectiveness) 10% sms_validate — SMS draft (≤160 chars, phone for SSN-FP guard) 10% playbook_record_replay — cold → record → warm w/ use_playbook=true Test results (5-min sustained, conc=50, 100k workers indexed): TOTAL 335,257 scenarios @ 1,115/sec cold_search_email 117k @ 0.0% fail · p50 2.2ms · p99 8.6ms surge_fill_validate 50k @ 98.8% fail (substrate bug below) profile_swap 50k @ 0.0% fail · p50 4.5ms · ExcludeIDs verified repeat_cache 50k × 5 = 252k searches @ 0.0% fail · p50 11.7ms sms_validate 33k @ 0.0% fail · phone-pattern guard works playbook_record_replay 33k @ 96.8% fail (substrate bug below) Total successful workflows: ~250k+ Validator integration verified at load: 150,930 EmailValidator passes across cold_search_email + sms_validate 35 + 1,061 successful FillValidator + playbook_record (where the bug didn't fire) zero false positives on the SSN-pattern guard against phone numbers Resource footprint at 100k: vectord 1.23GB RSS (linear with 100k vectors) matrixd 26MB, 75% CPU (1-core saturated at conc=50) Total across 11 daemons: 1.7GB Compare to Rust at 14.9GB — ~10× less even at 100k. SUBSTRATE BUG SURFACED: coder/hnsw v0.6.1 nil-deref in layerNode.search at graph.go:95. Triggers on /v1/matrix/playbooks/record under sustained writes to the small playbook_memory index. Both Add and Search paths can panic. Workaround applied (this commit) in internal/vectord/index.go BatchAdd: recover() guard converts panic to error; daemon stays up instead of crashing the request handler. Operator recovery procedure (also documented in the report): curl -X DELETE http://localhost:4215/vectors/index/playbook_memory Next record recreates the index fresh. Real fix DEFERRED — open in docs/ARCHITECTURE_COMPARISON.md Decisions tracker. Three options: a) upstream patch to coder/hnsw b) custom small-index Add path that always rebuilds when len < threshold c) alternate store for playbook_memory (Lance? in-memory map?) Evidence: reports/cutover/multitier_100k.md (full methodology + results + repro + bug analysis). docs/ARCHITECTURE_COMPARISON.md Decisions tracker updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 06:28:50 -05:00
root	3a2823c02f	g5 cutover: bigger load test — 5.87M req, 0 errors, 370MB RSS Larger-scale follow-up to the original load test. Three axis expansions: corpus 200→5K workers, body variety 6→200 distinct queries, concurrency sweep 10/50/100/200, plus mixed embed+search workload. Concurrency sweep on /v1/matrix/search direct (3 min each): conc=10: 486,733 req · 2,704 RPS · p50 2.19ms · p99 6.7ms conc=50: 1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms Mixed embed+search at 60 conc each, 90s: /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors. Resource footprint during peak load: matrixd 105% CPU, 33MB RSS (bottleneck — pegs 1 core) vectord 39% CPU, 82MB RSS gateway 44% CPU, 41MB RSS embedd 30% CPU, 67MB RSS Total RSS across 11 daemons: ~370MB Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU. Go uses ~40x less memory + spreads load across daemons rather than packing into one mega-process. Saturation analysis: - conc 10→50: +135% RPS (linear-ish scaling) - conc 50→100: +9% RPS (saturation begins) - conc 100→200: +17% RPS (matrixd 1-core pegged) Headroom paths if production exceeds current demand: 1. Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings via storaged), horizontal scale is straightforward. 2. Profile matrixd's per-request work (role-gate + judge-eligibility + result merge). 3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously measured). Evidence: reports/cutover/g5_load_test_big.md (full tables + methodology + repro script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:18:00 -05:00
root	2a974d6dea	docs: ARCHITECTURE_COMPARISON.md as living source file Per J's request: move the parallel-runtime comparison from reports/cutover/ (where it lived as cutover-prep evidence) into docs/ as the source-of-truth file. J will keep updating it as fixes ship on either side. Restructured for living-document use: - Status header (last refresh date, owner, update triggers) - 'How to update this doc' section with explicit dos and don'ts - Decisions tracker at top — actioned items with commit refs + open backlog with LOC estimates - Each comparison section now has 'Last verified' columns where numbers are time-sensitive - Change log section at bottom for one-line entries on every meaningful refresh The original at reports/cutover/architecture_comparison.md gains a 'THIS IS A SNAPSHOT' header pointing at the docs/ source. Kept as historical record but no longer the place to update. Sister pointer file in /home/profit/lakehouse/docs/ARCHITECTURE_COMPARISON.md so the doc is reachable from either repo side. That file explicitly says the source lives in golangLAKEHOUSE and warns against authoritative content in the pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:56:20 -05:00
root	b03521a506	validator: port FillValidator + EmailValidator from Rust validator crate Per architecture_comparison.md universal-win for Go side: ports the Rust crates/validator/src/staffing/ to internal/validator/. Production safety net Go was missing — FillValidator catches phantom worker IDs + status/blacklist/geo/role mismatches; EmailValidator catches SSN-shape PII + salary disclosure + wrong-target name in email/SMS drafts. Files: - types.go: Artifact (FillProposal \| EmailDraft), Validator interface, WorkerLookup interface, ValidationError + Finding + Severity - lookup.go: InMemoryWorkerLookup with case-insensitive ID lookup - fill.go: FillValidator — schema → completeness → cross-roster (phantom ID / status / blacklist / geo / role) - email.go: EmailValidator — schema → length → PII (SSN + salary) → worker-name consistency - fill_test.go + email_test.go: 24 tests covering happy path + every error variant + the load-bearing edge cases (phone-pattern not flagged as SSN, flanking-digit guard rejects extended numeric runs) Validator names match Rust (staffing.fill / staffing.email) so cross-runtime audit logs share the same identifier. PII scanners (containsSSNPattern, containsSalaryDisclosure) ported byte-for-byte so a draft flagged by one runtime is flagged by the other. Caveat: the Rust validator crate also has parquet_lookup.rs (loads workers_500k.parquet at startup) and playbook.rs (additional checks). Those weren't ported in this wave — only the two load-bearing validators that were named in the comparison doc. Closes one of the two universal-win items for Go side. The other (materializer port) remains deferred — it's a bigger surface change and depends on transforms.ts source-class adapters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:49:55 -05:00
root	b3ad14832d	architecture_comparison: Rust vs Go lakehouse — weaknesses, strengths, abstracts to address J asked for the comparison before locking in primary line. This report documents what's actually structurally different vs implementation-level different, and what to do about each. Key findings: 1. Python sidecar is the single biggest architectural lever - Rust: gateway → HTTP → Python sidecar :3200 → HTTP → Ollama - Go: gateway → HTTP → embedd → HTTP → Ollama (no Python) - Sidecar adds zero compute over Ollama (just pydantic + httpx) - 63× perf gap (8,119 vs 128 RPS) driven by sidecar + cache absence 2. Process model: Rust 1 mega-binary (14.9G RSS), Go 11 daemons - Rust: simpler ops at small scale, panic blast radius = whole system - Go: per-daemon scale + crash isolation, more config surface 3. Code volume: Go 15,128 lines vs Rust 35,447 + 1,237 sidecar - Go is 43% the size doing similar work - Gap concentrated in vectord (Rust 11k lines, Go 804 — Lance + benchmarking) 4. Distillation pipeline asymmetry - Audit/observation: BOTH sides parallel-mature - Production: Rust-only (materializer + replay + RAG/pref export) - Go can READ everything but can't PRODUCE evidence 5. Production validators (FillValidator/EmailValidator/'/v1/validate') - Rust has them (1,286 lines, 12 tests each) - Go doesn't — matrix gate covers role bleed but not structural validation Cross-cutting abstracts to address regardless of which wins: - Drop Python sidecar from Rust (call Ollama directly) - Add LRU embed cache to Rust aibridge - Port materializer + replay + validators to Go - Pin shared JSONL schemas as canonical (both runtimes consume same spec) - Decide on Lance backend (defer until corpus > 5M rows) If keeping Go primary: port materializer first, validators second, skip Lance. If keeping Rust primary: drop Python + add cache, port chatd 5-provider dispatcher + cross-role gate from Go. Bottom line: substrate is parallel-mature on observation; producer side is Rust-only; performance structurally favors Go ~60× on warm workloads; operations favors Go on isolation; production deployment favors Rust today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:34:24 -05:00
root	c164a3da96	g5 cutover: production load test — 0 errors / 101k req · Go direct = 2,772 RPS Sustained-traffic load test against the cutover slice. Three runs, zero correctness errors across 101,770 total requests. Substrate holds up under concurrent load — matrix gate, vectord HNSW, embedd cache, gateway proxy all hold. This was the load test's primary question; latency numbers are secondary. scripts/cutover/loadgen — focused Go load generator. 6-query rotating body mix (Forklift/CNC/Warehouse/Picker/Loader/Shipping). Configurable URL/concurrency/duration. Reports per-status-code counts + p50/p95/p99 latencies + JSON summary on stderr. Three runs: baseline (Bun → Go, conc=1, 10s): 4,085 req · 408 RPS · p50 1.3ms · p99 32ms · max 215ms sustained (Bun → Go, conc=10, 30s): 14,527 req · 484 RPS · p50 4.6ms · p99 92ms · max 372ms direct (→ Go, conc=10, 30s): 83,158 req · 2,772 RPS · p50 2.5ms · p99 8.5ms · max 16ms Critical findings: 1. ZERO correctness errors across 101k requests. No 5xx, no transport errors, no panics. Concurrency-safety verified across matrix gate / vectord / gateway / embedd cache. 2. Direct-to-Go is production-grade. 2,772 RPS at p99 8.5ms on a single host, no scaling cliff at concurrency=10. 3. Bun frontend is the bottleneck. -82% RPS, +982% p99 vs direct. Single-process JS event loop queueing under concurrent requests — known Bun proxy-mode characteristic. The substrate itself isn't the limiter. 4. For staffing-domain demand levels (<1 RPS typical per coordinator), Bun-fronted 484 RPS has 480× headroom. No urgency to optimize Bun out of the data path. If/when concurrent demand grows orders of magnitude, the path is nginx → Go direct for hot endpoints, skip Bun. Substrate is now load-tested and verified production-ready. What this load test does NOT cover (documented in g5_load_test.md): cold-cache embed, larger corpus, mixed read/write, multi-host, full 5-loop traffic with judge gate calls. Each is its own probe shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:20:41 -05:00
root	6507dff26d	g5 cutover: first 5-loop end-to-end through Bun frontend Companion to c522ace (cutover slice live). That commit proved infrastructure (Bun /_go/* → Go gateway). This commit proves the SUBSTRATE'S CORE LEARNING BEHAVIOR through the same path. Two tests against persistent Go stack on :4110 with the 200-worker corpus, all traffic via Bun frontend on :3700: TEST 1: same-role boost fires with exact math Q1: Need 3 Forklift Operators in Aurora IL for Parallel Machining query_role: "Forklift Operator" cold (use_playbook=false): rank=0 id=w-43 dist=0.4449 Brian Ramirez, Springfield IL POST /_go/v1/matrix/playbooks/record: query_text=Q1, role=Forklift Operator, answer_id=w-43, score=1.0 → playbook_id=pb-1126c52bd106df6b warm (use_playbook=true): rank=0 id=w-43 dist=0.2224 ← halved boosted=1, injected=0 Math check: BoostFactor = 1 - 0.5score = 0.5 (for score=1.0). Expected warm_dist = 0.4449 0.5 = 0.22245. Observed: 0.2224. 4-decimal exact through 3 HTTP hops. TEST 2: cross-role gate prevents bleed Q2: Need 1 CNC Operator in Detroit MI for Beacon Freight query_role: "CNC Operator" use_playbook: true (Forklift recording from Test 1 in playbook corpus) result: rank=0 id=w-175 Kevin Ruiz (Machine Operator, Detroit MI) rank=2 id=w-102 Laura Long (Forklift Operator, Cleveland OH) boosted=0, injected=0 ← role gate fired correctly w-102 (Forklift Operator) appears at rank 2 organically via cosine retrieval — but boosted=0 confirms the Forklift PLAYBOOK did NOT influence this query. Surgical: gate suppresses playbook-driven boosts from cross-role recordings, leaves organic retrieval untouched. What this confirms about the substrate: 1. Learning works — single recording → measurable, math-exact boost 2. Bleed protection works — role gate (real_001 fix) holds through cutover slice 3. Math holds across HTTP hops — Bun → gateway → matrixd → vectord with no drift 4. Substrate works through real production-shape framing — CORS, content-type, body forwarding, all transparent The substrate's reason-for-being (5-loop learning) is now demonstrably executing on persistent daemons under production-shape frontend traffic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:14:21 -05:00
root	c522acec8b	g5 cutover slice live — first real Bun-frontend traffic to Go substrate J said "let's go" → "next" (option 3): actual flip via Bun mcp-server. Done. Real Bun-frontend traffic now reaches the Go substrate via /_go/* on Bun :3700, routed to the persistent Go gateway at :4110. Companion change in /home/profit/lakehouse (Rust legacy): mcp-server/index.ts: new /_go/* pass-through, opt-in via GO_LAKEHOUSE_URL env var. Off-by-default (returns 503 on /_go/* with rationale). Existing /api/* (Rust gateway) path unchanged. Committed locally on the demo/post-pr11 branch. System config: /etc/systemd/system/lakehouse-agent.service.d/go-cutover.conf adds Environment=GO_LAKEHOUSE_URL=http://127.0.0.1:4110 to the systemd-managed Bun service. Reversible via systemctl revert lakehouse-agent. Live verification (operator curl through Bun frontend): - /_go/health: gateway responds {"status":"ok","service":"gateway"} - /_go/v1/embed: nomic-embed-text-v2-moe vectors, dim=768 - /_go/v1/matrix/search vs persistent 200-worker corpus: rank=0 id=w-43 Brian Ramirez (Forklift Operator, Springfield IL) rank=1 id=w-102 Laura Long (Forklift Operator, Cleveland OH) rank=2 id=w-101 Terrence Gray (Forklift Operator, Champaign IL) 3/3 role match, top-1 in IL exactly - /api/health: lakehouse ok (Rust path unchanged — control verified) What this is NOT: - Not an nginx flip — devop.live/lakehouse/* still goes through /api/* → Rust :3100. /_go/* is parallel slice for opt-in. - Not a tool-level cutover — each /_go/<path> is a manual choice; no automatic mapping of Rust paths to Go equivalents. - Not a transformation layer — caller sends Go-shaped requests (e.g. /_go/v1/embed expects {texts, model}, not {text}). Three cutover unit properties verified: - ADDITIVE: zero modification to any existing Bun tool - REVERSIBLE: unset GO_LAKEHOUSE_URL → /_go/* → 503 - ISOLATED: Rust gateway state unaffected (different port, different binary, different MinIO bucket) This is the cutover slice operators can use to validate Go-side handlers under realistic frontend conditions before any production-traffic flip. Next step (deferred): pick a specific mcp-server tool to optionally route through Go with response- shape adapter — that's a product-visible flip rather than this infrastructure-visible slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:45:41 -05:00
root	4fd560cad6	start_go_stack.sh: third isolation layer (port range :4xxx for persistent) Earlier push exposed the gap in the previous 2-layer isolation: smokes still failed because they tried to bind :3211-:3220 which my persistent stack already had. Smoke catalogd's bind-failure went undetected because poll_health 3212 succeeded responding to the persistent catalogd, and smoke proceeded against the wrong backend with the wrong bucket expectations. Fix: persistent stack now uses :4110 + :4211-:4219 via additional sed in the temp toml (bind addresses + upstream URLs). Smoke harnesses keep :3110 + :3211-:3219. Both reach the SAME chatd at :3220 because chatd is read-mostly (no state to clobber) and operators don't want to maintain two LLM provider key sets. Three isolation layers now in effect: 1. Binary names (bin/persistent-* via symlinks) 2. MinIO buckets (lakehouse-go-persistent vs lakehouse-go-primary) 3. Port range (:4xxx vs :3xxx, with shared chatd on :3220) Verified pre-push: - 11 persistent ports listening on :4xxx + :3220 - 0 smoke ports listening on :3110-:3219 (free for smokes) Pushed while persistent stack live — first cross-isolation test (no port collision, no bucket collision, no name collision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:26:41 -05:00
root	c48b58ff8d	start_go_stack.sh: 2-layer isolation from smoke harness The 2026-05-01 persistent-stack milestone exposed two collision modes between the long-running Go stack and the pre-push smoke harness: 1. PKILL COLLISION: smoke teardown uses anchored `pkill -f "bin/(storaged\|...\|gateway)$"`. Same-named persistent processes match → smokes kill 7 of 11 persistent daemons. 2. MINIO STATE COLLISION: persistent stack writes `_vectors/workers.lhv1` to the shared lakehouse-go-primary bucket. Smoke vectord rehydrates from same bucket → sees both smoke-owned and persistent-owned indexes → assertion failures. Both fixed in this commit by adding two isolation layers: LAYER 1 — distinct binary names via symlink: bin/persistent-<daemon> → bin/<daemon> Persistent stack runs as ./bin/persistent-gateway etc. Smoke pattern `bin/(name)$` matches `bin/gateway$` but NOT `bin/persistent-gateway$` (regex group requires bin/ followed immediately by a daemon name; "bin/p..." doesn't qualify). Cmdline lookup verified: 7 persistent procs, 0 match smoke pkill. LAYER 2 — separate MinIO bucket via temp config: Persistent stack writes to lakehouse-go-persistent (configurable via $LH_PERSISTENT_BUCKET). Temp toml at /tmp/lakehouse-persistent.toml inherits everything from lakehouse.toml except [s3].bucket which is sed-replaced. Bucket auto-created via mc if missing. Verified: workers.lhv1 lands in persistent bucket; primary bucket _vectors/ stays empty. Net effect: the persistent stack should survive `git push` (which runs smokes that rehydrate vectord from primary bucket and pkill their own bin/<name>$ daemons). This commit is the first push test WITH the persistent stack live. Caveat: bin/persistent-* symlinks are gitignored already (/bin/ is in .gitignore wholesale), so the symlinks need to be created on each fresh checkout — which start_go_stack.sh does idempotently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:20:00 -05:00
root	77a3dcf266	cutover: first end-to-end coordinator query against persistent Go stack Three real-shape demand queries against the long-running 11-daemon stack with 500 workers ingested from workers_500k.parquet (real production data). Substrate is producing useful answers: Q1 (Forklift @ Aurora IL): 5/5 role match, top 3 in IL, dist 0.44-0.46 Q2 (CNC @ Detroit MI): top-1 in Detroit MI exactly, role pulls Machine Operator (semantic neighbor) Q3 (Warehouse @ Indianapolis IN): top-1 in Indianapolis IN, 5/5 role match, dist 0.42-0.54 This is the FIRST end-to-end coordinator-shape query against the persistent Go stack — every prior reality test (real_001..real_005) ran through harness-transient stacks that died on exit. This one ran against daemons that have been up for minutes and stayed up through retrieval. Geo is load-bearing: top-1 city/state matched in 3/3 queries. Embedder treats geography as a primary feature. Q2's CNC→Machine Operator gap exposes the playbook learning loop's purpose: judge would rate this ~3/5; the first time a coordinator approves a Machine Operator for a CNC Operator query, that recording starts shifting substrate behavior. That's the loop we've been building toward — the persistent stack is now the substrate that loop will run on. Evidence: reports/cutover/persistent_stack_first_query.md (full top-K tables + read on each query). What this does NOT prove: - Production-volume load (3 queries, 500 workers) - Concurrent latency - Full 5-loop substrate (this exercised retrieval only; no playbook recordings exist on the persistent stack yet) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:10:09 -05:00
root	54b2e7db76	start_go_stack.sh: document smoke-vs-persistent-stack pkill conflict Caught immediately after the prior commit pushed: pre-push smokes killed 7 of 11 persistent Go daemons because the smokes' anchored `pkill -f "bin/(name)$"` teardown matches ANY process named `bin/<daemon>`, not just the smokes' own children. Documented in the script header as a KNOWN CONSTRAINT with a workaround (re-run start_go_stack.sh after every push) and a proper-fix sketch (give the persistent stack a different binary name via build tag or symlink). Proper fix deferred until trigger fires — operators living through this once will know to want it. Persistent stack restored (all 11 healthy as of this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:56:52 -05:00
root	09904d5222	cutover: persistent Go stack milestone — first long-running deployment + first Go-emitted audit_baselines entry J's "let's go" instruction: leave OPEN list behind, push the Go substrate forward into actual deployment shape. This commit marks the first time the Go side has run as long-running daemons rather than per-harness transient processes, and the first time the shared cross-runtime longitudinal log has carried a Go-emitted entry alongside the Rust ones. What landed: scripts/cutover/start_go_stack.sh — the persistent-stack runbook. Brings up all 11 daemons (storaged → catalogd → ingestd → queryd → embedd → vectord → pathwayd → observerd → matrixd → gateway, plus chatd-if-not-already-up) in dependency order via nohup + disown. Anchored pkill per feedback_pkill_scope (never bare "bin/"). Logs land in /tmp/gostack-logs/<bin>.log, one per daemon. Verified live state: - All 11 services healthy on :3110 + :3211-:3220 - gateway → embedd proxy returns nomic-embed-text-v2-moe vectors - chatd reports 5/5 providers loaded - No port collision with Rust gateway on :3100 - Daemons stay up after exit of the start script (production shape, not harness-transient) audit_baselines.jsonl crosses the runtime boundary: - 7 Rust-emitted entries (last: ca7375ea 2026-04-27) - 1 Go-emitted entry (ee2a40c 2026-05-01T07:53:54Z) appended via ./bin/audit_full -append-baseline - Same envelope shape, same metric set, same drift comparator semantics — operators running either runtime grow the same log What this DOES prove: - Substrate parity at deployment shape (not just unit tests) - Cross-runtime artifact write-side compatibility (was previously proven on read side via audit_baselines roundtrip) - The deploy machinery works end-to-end for the persistent case What this does NOT prove (still ahead): - Real coordinator traffic against the Go stack (no nginx flip yet; devop.live/lakehouse/ still serves through Rust) - Go-side production materializer (Phase 2 is observer-only) - Replay tool parity (Phase 7 is observer-only) - The 5-loop product gate against actual humans reports/cutover/SUMMARY.md now logs three new rows: - audit-FULL with 12/12 phases ported - First Go-emitted audit_baselines entry - Persistent Go stack live Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:55:29 -05:00
root	ee2a40c505	audit-FULL: port phases 1/2/5/7 — only acceptance.ts (TS-only) remains skipped Closes 4 of the 5 phases the initial audit-FULL port left as deferred. The pattern: most "deferred" phases didn't actually need the un-ported Rust pieces — they were observer-mode by design and just needed to read existing on-disk artifacts. Phase 1 (schema validators) → ported via exec.Command: Invokes `go test ./internal/distillation/...` — the Go equivalent of Rust's `bun test auditor/schemas/distillation/`. New GoTestModule field on AuditFullOptions controls the package pattern; empty disables the invocation (test mode, prevents recursion when audit-full is invoked from inside `go test`). Phase 2 (evidence materialization) → ported as observer: Reads data/evidence/ directly and tallies rows + tier-1 source hits. Doesn't re-run the materializer (which is Rust-side TS). Emits p2_evidence_rows + p2_evidence_skips metrics matching Rust shape — drop-in audit_baselines.jsonl entries possible. Phase 5 (run summary) → ported as observer: Reads reports/distillation/{run_id}/summary.json + 5 stage receipts. Validates schema_version=1, run_hash sha256, git_commit 40-char hex, all stage receipts decode as JSON. Full schema validation (StageReceipt schema) is intentionally NOT ported — it would require porting the TS schemas/distillation/ validators in full; basic shape checks catch the load-bearing invariants. Phase 7 (replay log) → ported as observer: Reads data/_kb/replay_runs.jsonl, validates last 50 rows parse as JSON. Skips the live-replay invocation that Rust's phase 7 also does — porting Rust replay.ts is substantial and not in scope. The "log shape sanity" check is what audit-full actually needs; the live invocation is a separate concern. Phase 6 (acceptance gate) — STILL SKIPPED: Rust acceptance.ts is a TS-only fixture harness with bun-specific deps. Porting the fixtures (tests/fixtures/distillation/acceptance/) + the 22-invariant runner to Go is an ADR-worth undertaking. Documented in the header comment. Live-data probe (against /home/profit/lakehouse): Skips count: 4 → 1 (only phase 6). Required checks: 6/6 → 12/12 PASS. New metric: p2_evidence_rows=1055, BYTE-EQUAL to the Rust pipeline's collect.records_out from the latest summary.json. Cross-runtime parity now extends across phases 0/1/2/3/4/5/7. 6 new tests: - TestPhase2_EvidenceTallyFromOnDisk: row + tier-1-hit tallying - TestPhase5_FullSummaryFlow: complete run-summary fixture passes - TestPhase5_ShortRunHashCaught: bad run_hash fails required check - TestPhase7_ReplayLogReadsFromDisk: row-count reporting - TestPhase7_MalformedTailRowsCaught: structural parse failure - TestRunAuditFull_FullFixtureFlow updated to seed evidence/ + reports/distillation/ for the phases now wired. Cleanup: removed local sortStrings helper (replaced with sort.Strings now that `sort` is imported for phase 5's mtime-sort). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:13 -05:00
root	55b8c76a8c	distillation: audit-FULL pipeline port (phases 0/3/4) — cross-runtime metric parity verified Ports the metric-collection passes from scripts/distillation/audit_full.ts. The substrate that PRODUCES audit_baselines.jsonl entries — the half OPEN #2 left as "deferred to next wave" after the read/write substrate landed in ca142b9. Phase coverage: Phase 0 (file presence) ported Phase 1 (schema validators) skipped (Go's `go test` covers it) Phase 2 (materializer dry-run) deferred (Go materializer not yet ported) Phase 3 (scored-runs distribution) ported Phase 4 (contamination firewall) ported Phase 5 (receipts validation) deferred (Go run-summary JSON not yet emitted) Phase 6 (replay sanity) deferred (Go replay tool not ported) Phase 7 (run summary lineage) deferred (same) Cross-runtime parity verified end-to-end: Go-side audit-full against /home/profit/lakehouse produced metrics IDENTICAL to the last Rust-emitted audit_baselines.jsonl entry. All 8 ported metrics match byte-for-byte: p3_accepted=386, p3_partial=132, p3_rejected=57, p3_human=480, p4_sft_rows=353, p4_rag_rows=448, p4_pref_pairs=83, p4_total_quarantined=1325 6/6 required checks pass on live data. Components: - internal/distillation/audit_full.go: PhaseCheck struct (mirrors Rust shape), PhaseCheckReport aggregation, RunAuditFull orchestrator, auditPhase0/3/4 implementations, FormatAuditFullReport Markdown writer. - cmd/audit_full/main.go: CLI binary with -root, -out, -json, -append-baseline flags. Operators run "./bin/audit_full -append-baseline" to grow the longitudinal log alongside the Rust pipeline (entries are interchangeable — same envelope shape). - 6 new tests: empty-root failure handling, full-fixture clean PASS (locks all 8 metrics + all 6 required checks), SFT firewall contamination detection, preference self-pair detection, sig_hash regex correctness (rejects wrong-length + uppercase), Markdown formatter smoke. Live-data probe captured at reports/cutover/audit_full_go_vs_rust.md (linked from reports/cutover/SUMMARY.md). Same shape as the audit_baselines round-trip evidence — both Go-side ports of the distillation surface are now validated against real Rust data, not just fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 01:30:23 -05:00
root	eb0dfdff04	vectord: v2 envelope + handleMerge robustness — actions post_role_gate_v1 scrum 3-lineage scrum on 434f466..0d4f033 surfaced one convergent finding (Opus + Kimi) and 3 Opus-only real bugs. All actioned in this commit. Two false positives (Kimi rollback misreading, Opus stale- comment claim) verified + rejected — both required manual control- flow inspection to refute, matching the documented Kimi-truncation behavior in feedback_cross_lineage_review.md. Convergent fix — DecodeIndex lost nil-meta items: - Envelope version bumped 1 → 2. - New v2 field: IDs []string carries the canonical ID set explicitly, independent of meta map's nil-vs-{} sparseness. - DecodeIndex accepts both versions: v2 reads from env.IDs; v1 falls back to meta-key inference (with the documented limitation that nil-meta items are invisible — preserved for backward-compat with already-persisted indexes). - Encode emits v2 going forward. - 2 new regression tests: - TestEncodeDecode_NilMetaItemsSurviveRoundTrip: items added with nil metadata MUST survive Encode → Decode and remain visible to IDs(). Pre-fix would have yielded IDs() == []. - TestDecodeIndex_V1BackwardCompat: hand-crafted v1 envelope still decodes (proves the fallback path). Opus-only fixes: - handleMerge: non-ErrIndexNotFound errors at h.reg.Get(name) / h.reg.Get(req.Dest) now return 500 + log instead of falling through with nil src/dest pointers (which would panic on the next deref). Real bug — only the sentinel error was handled. - internal/drift/drift.go: mathLog wrapper removed; math.Log inlined. Wrapper added no value (math was already imported). - internal/distillation/audit_baseline.go: BuildAuditDriftTable's bubble sort replaced with sort.Slice. Idiomatic + shorter. Rejected after verification: - Kimi WARN "missing rollback on partial merge": misread the control flow. Code at cmd/vectord/main.go:404-414 does NOT delete from src when dest.Add fails (continue before reaching src.Delete). Only successful Adds trigger Deletes. - Opus INFO "TimestampUnixNano comment references missing field": field exists at scripts/multi_coord_stress/main.go:128. Opus saw only the diff context, not the full file. Deferred (no fired trigger): - Opus WARN "no per-index lock during merge": no concurrent merge callers today (operators run merge as deliberate one-shot job). Worth a lock if/when matrixd or chatd start auto-triggering. Disposition: reports/scrum/_evidence/2026-05-01/verdicts/post_role_gate_v1_disposition.md. Build + vet + tests green; 2 new regression tests + all prior tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 01:20:37 -05:00
root	0d4f033b34	audit_baselines: round-trip validation against live Rust data Same shape of proof as embed_parity.sh for the embed endpoint: take the just-shipped Go port (ca142b9) and validate it against the actual production data the Rust legacy emits, not just unit- test fixtures. Locks the cross-runtime parity that operators running mixed pipelines depend on. scripts/cutover/audit_baselines_validate.go: - Reads /home/profit/lakehouse/data/_kb/audit_baselines.jsonl - Parses every entry via the Go AuditBaseline struct - Round-trips the last entry: encode → decode → field-by-field equality check (catches any silently-dropped JSON keys) - Calls LoadLastBaseline against the live file (proves the public API works on real shapes, not just inline parsing) - Computes BuildAuditDriftTable(first → last) — full-window lineage drift over the captured baselines Live-data probe results (reports/cutover/audit_baselines_roundtrip.md): - 7 entries parse without error - Round-trip is byte-equal on every metric + every header field - Drift table fires the expected verdicts: - p2_evidence_rows 12→82 (+583%) → warn (above 20% threshold) - p3_accepted/partial/rejected/human 0→non-zero → warn (the zero-baseline edge case TestBuildAuditDriftTable_ZeroBaseline was designed to lock — verified now firing on real history) - p4_* metrics +0% → ok (stable across the window) What this does NOT prove (documented in the report): the Go-side audit-FULL pipeline that PRODUCES baselines doesn't exist yet. Only the load/append/drift substrate is ported. Operators running audit-full from Go would still need a metric-collection pass — that's a separate port deliberately not in this wave. reports/cutover/SUMMARY.md gains a new row alongside the embed parity entries; cutover-prep verification log keeps the discipline of "verified against real data, not just fixtures." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:20:18 -05:00
root	ca142b9271	distillation: audit-baselines lineage port — fully closes the OPEN #2 surface The original OPEN #2 line called for "SFT export pipeline + audit_baselines lineage." Commit 7bb432f shipped the SFT export. This commit ports the audit_baselines half — the longitudinal drift signal that distinguishes "metrics shifted because the world changed" from "metrics shifted because we broke something." Mirrors Rust scripts/distillation/audit_full.ts's substrate: - LoadLastBaseline(path) reads the most recent entry from data/_kb/audit_baselines.jsonl. Returns (nil, nil) on missing file (first run), errors on truncated last line (partial-write detection — operators don't lose drift signal silently). - AppendBaseline(path, baseline) appends one entry as a JSON line. Atomic at the line level via bufio + O_APPEND. Creates the parent directory if missing. - BuildAuditDriftTable(prior, current, threshold) computes per-metric drift. flag values mirror Rust exactly: first_run, ok, warn. DefaultDriftWarnThreshold = 0.20 = Rust's 20%. - FormatAuditDriftTable renders a fixed-width text grid for stdout dumps in audit-full runs. Edge cases handled: - Zero-baseline: prior=0 means no division — PctChange stays nil. current=0 → ok (no change). current>0 → warn (zero→nonzero is always notable, never silently fine). - New metric in current: flagged first_run, not "0%-change". Operators see "this is a new signal we haven't tracked before." - Sort: stable by metric name for deterministic JSON output and clean CI diffs. Generic on metric name (vs Rust's pinned p2_evidence_rows etc.): the Rust phase numbering doesn't translate to Go directly. The AuditBaselineRustCompat constant pins the Rust names so operators running both runtimes use the same labels, which makes drift comparison meaningful across the two pipelines. 13 new tests covering: missing file, last-line-wins, blank-line tolerance, malformed-line errors, append round-trip, append-to- existing, schema validation, first-run, threshold boundary, zero-baseline, new-metric-in-current, sort-by-metric stability, formatter output rendering. OPEN #2's "audit_baselines lineage" half now closed. The distillation package surface is at parity with the Rust pipeline: scorer, scored runs, SFT export, audit baselines all available on the Go side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:11:47 -05:00
root	7bb432f6c8	distillation: full SFT export port — closes OPEN #2 fully Follow-up to b216b7e (which shipped the SFT export substrate). This commit ports the synthesis logic, completing the migration: - SynthesizeSft(scored, ev, recordedAt, sftID) → *SftSample Mirrors the Rust synthesizeSft byte-for-byte. Returns nil for extraction-class records + empty-text records (same skip semantics as Rust). - LoadEvidenceByRunID(scoredPath, cache) reads the paired evidence JSONL (path derived by /scored-runs/ → /evidence/ replacement). Per-call cache so multiple scored-runs files in the same dir don't reload the same evidence. - buildInstruction maps source_file stem → per-class instruction template. All 8 templates (scrum_reviews, mode_experiments, auto_apply, audits, observer_reviews, contract_analyses, outcomes, default) match Rust output exactly so a/b validation between runtimes can diff JSONL byte-for-byte. - stemFromSourceFile strips data/_kb/ prefix + .jsonl suffix. - ExportSft now writes data/distilled/sft/sft_export.jsonl with the synthesized samples (DryRun=true skips file write). Per-class templates verified by 8-case sub-test: - scrum_reviews → "Review the file '...' against the PRD..." - mode_experiments → "Run task_class='...' for file..." - auto_apply → "Auto-apply: emit a 6-line surgical patch..." - audits with phase: prefix → strips to bare phase name - observer_reviews → "Observer-review the latest attempt..." - contract_analyses with permit: prefix → strips to permit ID - outcomes → "Run scenario; report per-event outcome..." - unknown source → "Source 'X' run; produce the appropriate output" Caveat documented inline: contract_analyses uses ev.metadata.contractor in Rust to produce "Analyze contractor 'X' for permit 'Y'" when present. Go's EvidenceRecord doesn't carry a free-form metadata bag yet, so we always emit the no-contractor form. Operators needing contractor-aware instructions can extend EvidenceRecord with an explicit Metadata field (separate ADR). Test additions (5 new): - TestSynthesizeSft_PerSourceClass: 8 sub-cases, one per template - TestSynthesizeSft_RejectsExtraction: extraction-role records skipped - TestSynthesizeSft_RejectsEmptyText: empty/whitespace text skipped - TestSynthesizeSft_ContextAssembly: matrix + pathway + model context string formatting matches Rust " · " join - TestExportSft_FullPort_WritesJSONL: end-to-end fixture, asserts output contains expected instruction + omits firewalled records Pre-existing TestExportSft_PartialPort_FirewallFires renamed + updated to TestExportSft_FirewallFiresBeforeEvidenceLoad — reflects the new contract that records passing the firewall but lacking evidence land in "not-instructable" rather than being silently exported. Honest semantics shift documented in the test. OPEN #2 now fully closed (was: substrate-only). The synthesis path no longer requires the Rust pipeline to be invoked — Go-side operators can run the full distillation export end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:06:57 -05:00
root	b216b7e5b6	fix the other 4: close all OPEN-list items in one wave Substantial wave addressing all 4 prior OPEN items. Three closed in full, one partially (the speculative half deliberately deferred). OPEN #1 — Periodic fresh→main index merge (FULL): - POST /v1/vectors/index/{src}/merge with {dest, clear_source} - Idempotent on re-runs (existing-in-dest items skipped) - internal/vectord/index.go: new Index.IDs() snapshot method + i.ids tracker field as canonical ID set, independent of meta map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta alone missed items added with nil metadata) - 4 cmd-level integration tests (happy path drain+clear, dim mismatch, dest not found, self-merge rejection) + 1 unit test - DecodeIndex backward-compat: old envelopes restore i.ids from meta keys (best effort; new items going forward use the tracker) OPEN #2 — Distillation SFT export (SUBSTRATE): - internal/distillation/sft_export.go ports the load-bearing half: IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/ MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft. - Synthesis (instruction/input/response generation) deferred to a separate wave — too big for this session, but the substrate makes the next wave a port-not-design exercise. - TestSftNever_PinsExpectedSet locks the contamination firewall set: if a future commit adds/removes from SftNever, this test fails — forcing the change through review. - 5 new tests; firewall fires end-to-end through the partial port. OPEN #3 — Distribution drift via PSI (FULL): - internal/drift/drift.go: ComputeDistributionDrift via Population Stability Index. Standard finance/risk metric, well-defined verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). - Equal-width bucketing over combined min/max so neither dist falls outside; epsilon-clamping for empty buckets so log doesn't blow up. Per-bucket breakdown for drilldown. - Pairs with the existing ComputeScorerDrift: scorer drift is categorical, distribution drift is continuous. Different shapes, same package. - 7 new tests covering identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical- safe, bucket-counts-conserved, num-buckets-clamping. OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others deferred): - (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`). Output.PhaseTimings + Output.TotalElapsedMs in JSON. - (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger, would be speculative. Documented as deferred-until-need rather than ignored. Per the project's discipline ("don't add features beyond what the task requires"). OPEN list now empty / steady-state. Future items will land as production triggers fire. Build + vet + tests green; 18 new tests across the 4 closures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:42:11 -05:00
root	356d76b4b0	multi_coord_stress: thread role through matrix retrieve + playbook record Real wire-up gap discovered post-scrum: Demand.Role was already extracted at every call site in multi_coord_stress (44 occurrences, both contract-driven and LLM-parsed inbox-triggered paths), but neither matrixSearch nor playbookRecord accepted role in their signatures. Cross-role gate (real_001..real_004 work) was bypassed for the entire multi-coord harness — recordings and queries went through with empty role, gate fell back to lenient behavior. Fix: - matrixSearchReq gains query_role field - matrixSearch signature: (..., query, role string, ...) - tracedSearch wrapper gains role param + emits it in span input metadata for Langfuse visibility - playbookRecord signature: (..., query, role, ...) — body emits role only when non-empty (preserves backward compat at API) - 14 call sites updated: contract-driven Demand loops → d.Role LLM-parsed inbox path → parsed.Role (qwen2.5 already extracts it) swap path (warehouseDemand) → warehouseDemand.Role reissue path → ev.Role (captured at original event time) fresh-verify (resume snippet, no role concept) → "" Build clean, vet clean, all tests pass. Cross-role gate now fires end-to-end across the multi-coord harness — matches the playbook_lift harness's coverage from the original real_001 fix. This closes the symmetric gap to scripts/playbook_lift's existing wire-through. Both production-shape harnesses now exercise the role gate; future reality tests automatically inherit the protection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:10:49 -05:00
root	cca32344f3	reality_test real_005: negation probe — substrate gap is correctly out-of-scope 5 explicit-negation queries ("Need Forklift Operators in Aurora IL, NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.) through the standard playbook_lift harness. Goal: characterize whether the substrate has negation handling or silently treats "NOT X" as "X". Headline: substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" identical to "in Detroit" plus noise — there is no logical-quantifier representation in the embedding space. This is a structural property of dense embeddings, not a substrate bug. Per-query observations: - Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge - Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK because role+city signal pulled non-Beacon worker naturally - Q3 (excluding Cornerstone): unanimous 1/5 across top-10 - Q4 (NOT Detroit-area): all top-10 rated 1-2/5 - Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK The judge IS the safety net: when retrieval can't honor the constraint, the judge refuses to approve any result. That's the honesty signal — `discovery=0` for the run aggregates it. No code change. The architectural answer for production is: - UI surfaces an "exclude" affordance that populates ExcludeIDs (already supported, added in multi-coord stress 200-worker swap) - Coordinators don't type natural-language negation — they click - Substrate's role: surface honesty signal (judge ratings) + don't pretend to honor unparseable constraints Adding NL-negation handling at the substrate level would be product debt — it would let coordinators type sloppier queries that silently fail when the LLM extractor misses a phrasing. Don't ship until production traffic demonstrates demand for it. Findings: reports/reality-tests/real_005_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:06:06 -05:00
root	434f466288	matrix: roleNormalize allowlist for non-plural-s tokens (scrum role_gate_v1) 3-lineage scrum review of the role-gate work (commits 7f2f112..0331288) ran Opus 4.7 / Kimi K2.6 / Qwen3-coder via scripts/scrum_review.sh. All three flagged the same edge case: the homegrown plural-stripper in roleNormalize would collapse non-plural-s tokens like "Sales" → "Sale", "Logistics" → "Logistic", "Operations" → "Operation". In a staffing domain those are real role names; the silent normalization would have caused false role-equality matches and re-opened the cross-role bleed for those clusters. Fix: - nonPluralSWords allowlist for known staffing-domain non-plural-s tokens (sales, logistics, operations, facilities, premises, news, physics, economics, mathematics, analytics). - Last-word-only stripping ("Sales Associate" stays whole; only "Associates" head noun is plural-checked). - -ss ending check so "Press Operator", "Boss" don't lose their s. - strings.ToLower + strings.TrimSpace replace the homegrown rune- loop ASCII normalizer (Opus INFO — minor cleanup, folded in). Tests: - TestRoleNormalize_NonPluralS: 18 cases covering the allowlist, -ss ending, real plurals (Operators → Operator, Boxes → Box), multi-word real plurals (Forklift Operators → forklift operator), whitespace/case tolerance. - TestRoleEqual_NonPluralS: gate-level pairing — proves equal- shape allowlisted tokens compare equal AND that "Sales" ≠ "Sale" (the original bug shape). - Existing TestRoleEqual_PluralAndCase still green (refactor preserved behavior). Other scrum findings dispositioned (not actioned): - Opus WARN on empty-role fail-open semantics: documented backward-compat behavior; production path closes via opt-in LLM extractor (real_004). - Opus INFO on unsynchronized package-global cache map: harness is single-goroutine; add sync.Mutex when/if it parallelizes. - Opus INFO on parallel constructor (NewPlaybookEntryWithRole vs optional arg): API smell only, both forms preserved. - Kimi 2 BLOCKs (NewPlaybookEntryWithRole missing, ApplyPlaybookBoost signature breakage): FALSE positives. Pre-push smoke chain green on 0331288, both symbols + all call sites compile clean. Matches feedback_cross_lineage_review.md's documented Kimi truncation behavior — Kimi BLOCKs warrant trace verification before action. Disposition (local): reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:58:02 -05:00
root	0331288641	playbook_lift: LLM-based role extractor closes shorthand bleed (real_004) real_003 left a known-weak hole: shorthand-style queries ("{count} {role} {city} {state} ...") have no separator between role and city, so a regex can't reliably extract — leaving the cross-role gate disabled when both record AND query are shorthand. This commit adds a roleExtractor with regex-first + LLM fallback: - Regex first (fast, deterministic) — handles need + client_first + looking from real_003b. ~75% of styles, no LLM cost paid. - LLM fallback when regex returns empty AND model is configured — Ollama-shape /api/chat with format=json, schema-tight prompt, temperature 0. ~1-3s on local qwen2.5. - Per-process cache — paraphrase + rejudge passes reuse the same query 4× per run; cache prevents 4× LLM cost. - Off-by-default — opt-in via -llm-role-extract flag (CLI) and LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping config unchanged unless explicitly enabled. 8 new tests in scripts/playbook_lift/main_test.go: - TestRoleExtractor_RegexFirst: LLM not called when regex matches - TestRoleExtractor_LLMFallback: shorthand goes to LLM - TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved - TestRoleExtractor_Cache: 3 calls = 1 LLM hit - TestRoleExtractor_NilSafe: nil receiver runs regex only - TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths - TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic witness for the real_003 scenario — both record + query are shorthand, regex returns "" for both, LLM produces DIFFERENT role tokens for CNC vs Forklift, so matrix gate's cross-role rejection (locked separately in TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires correctly. This is the load-bearing verification. Reality test real_004 ran the same 40-query stress as real_003 with LLM extraction on. Cross-style same-role boosts fired correctly across all 4 styles for Loaders + Packers + Shipping Clerk clusters (including shorthand → other-style transfer). No cross-role bleed observed. The reality test alone can't be a clean "with vs without" comparison (HNSW build is non-deterministic across runs, and real_004 stochastics didn't trigger a shorthand recording at all), which is why the unit-test witness exists. Production note (in real_004_findings.md): LLM extraction is for reality-test coverage of arbitrary query shapes. Production should extract role at INGEST time (when the inbox parser already runs an LLM) and pass already-resolved role through requests — same shape as multi_coord_stress's existing Demand{Role: ...} model. The hot path should never need the harness extractor's per-query LLM cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:51:27 -05:00
root	3263254f1c	reality_test real_003: 40-query paraphrase stress + extractor extension Stress-tests the role gate with 40 queries (10 fill_events rows × 4 styles): need, client_first, looking, shorthand. Each row's role + client + city stays the same; only the surface phrasing changes. real_003 (original extractor) confirmed the shorthand-vs-shorthand failure mode: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query within the same Beacon Freight Detroit cluster. Both record + query had empty role (extractor returns "" for shorthand because there's no separator between role and city), gate disabled, distance check passed, bleed fired. Fix: extended extractRoleFromNeed to handle client_first ("{client} needs N {role} in...") and looking ("Looking for N {role} at...") patterns. Shorthand left intentionally unmatched — "Forklift Operator Detroit" is shape-indistinguishable from "Forklift" + "Operator Detroit" without an LLM extractor or known- cities lookup. real_003b (extended extractor) verifies bleed closed across all 4 styles for this dataset. Forklift Operator queries keep w-2136 (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. scripts/cutover/gen_real_queries.go: added -styles flag with values need\|client_first\|looking\|shorthand\|all (default need preserves real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is the 40-query stress file. scripts/playbook_lift/main_test.go: 10 sub-tests lock the four documented patterns + shorthand limitation + lift-suite-style queries (no clean role, returns empty as expected). Aggregate metrics: - real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108 - real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202 The growth reflects more LEGITIMATE same-role same-cluster transfer firing across styles, not bleed (verified by per-cluster bleed table — Forklift Operator queries unchanged across all 4 styles). Known limitation documented in real_003_findings.md: same-cluster, same-role queries in shorthand still embed close enough that a shorthand recording could bleed onto a different-role shorthand query if both record + query strip role. Closing this requires LLM extraction or known-cities lookup at record + query time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:42:02 -05:00
root	997527be4d	matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1 ) real_001 surfaced same-client+city queries bleeding across roles: Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) embedded within 0.13-0.18 cosine of Q#2's query — well inside the 0.20 inject threshold — so e-6193 injected on both, demoting the cold-pass-correct workers. Root cause: the inject distance threshold isn't tight enough on the same-client+city cluster. Cosine collapses queries that share city + client + count-token + time-token regardless of role. The existing judge gate is per-injection at record time and doesn't fire at retrieve time. Fix: structural role gate in front of both Shape A boost and Shape B inject. PlaybookEntry gains Role; SearchRequest gains QueryRole. When both are non-empty and differ under roleEqual's case+plural normalization, the entry is rejected before BoostFactor or judge-gate logic runs. Backward-compat: empty role on either side disables the gate — preserves behavior for the lift suite's free-form multi-constraint queries that have no clean single role. Caller-supplied (not inferred), so existing recordings unaffected. Wire-through: - internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole, roleEqual helper with plural+case normalization - internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded to both ApplyPlaybookBoost + InjectPlaybookMisses - cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk - scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls role from "Need N {role}{s} in" queries (the fill_events shape); free-form queries fall back to empty (gate disabled) Tests (5 new): - TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10 scenario (distance 0.135, recorded "Forklift Operator", query "CNC Operator") — locks the bleed at unit level - TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator recording fires on Forklift Operators query (plural normalization) - TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on either side = gate disabled, preserves current behavior - TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense in depth — boost doesn't fire on cross-role even when answer is in cold top-K - TestRoleEqual_PluralAndCase: case + -s + -es plural normalization Verification (real_002, same query set as real_001): - Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed) - Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed) - Discoveries + lifts unchanged at 2 each (same-role lift still fires) - Mean Δdist tightens from -0.127 to -0.040 (boosts no longer pulling distances through the floor on cross-role mismatches) Findings: reports/reality-tests/real_002_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:34:10 -05:00
root	7f2f112e6a	reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed First retrieval probe with non-synthetic query distribution. Pulls N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet (real-shape demand data) and translates each to the natural language a coordinator would type: "Need {count} {role}s in {city} {state} starting at {at} for {client}". Headline: 8/10 cold-pass top-1 = judge-best on real distribution. Substrate works on queries it was never trained for. v2-moe + workers corpus carry the load. Surfaced finding (the real value of running this): same-client+city queries cluster, and Shape A's distance boost bleeds across roles within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1 even though: - Neither query has its own recorded playbook. - Neither warm pass triggers a Shape B inject (boosted=0). - The roles are different staffing categories. Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) for a worker who was approved by the judge for a different role on a different query. Why the lift suite missed it: synthetic queries use 7 disjoint scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand clusters on (client, city). The cluster doesn't exist in the synthetic distribution. Why the judge gate doesn't catch it: the gate (5a3364f) is per-injection at record time. After approval the worker rides Shape A distance boosts on all later same-cluster queries with no second gate call. Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. Cheap; doesn't need new judge calls. Files: - scripts/cutover/gen_real_queries.go: parquet → coordinator NL - tests/reality/real_coord_queries.txt: 10 generated queries - reports/reality-tests/playbook_lift_real_001.md: harness output - reports/reality-tests/real_001_findings.md: the reading Repro: go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:18:40 -05:00
root	5687ec65c2	G5 cutover prep: embed parity probe — Rust /ai/embed ↔ Go /v1/embed verified First concrete cutover artifact: scripts/cutover/embed_parity.sh brings up Go embedd + gateway alongside the live Rust gateway, hits both /ai/embed and /v1/embed with the same forced model, and emits a per-date verdict report under reports/cutover/. Why embed first: the parity invariant is one math identity (cosine sim of vectors against same input). Retrieve has thousands of edge cases. If embed parity holds, all downstream vector consumers inherit confidence; if it doesn't, we catch it in 30s instead of after a flip. Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both Ollamas have it loaded). Math is provably equivalent across the gateway plumbing. Drift catalog (reports/cutover/SUMMARY.md): - URL: Rust /ai/embed vs Go /v1/embed - Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors, dimension} (singular). Wire-format adapter is the only real cutover work for this endpoint. - L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same direction (cos=1.0); harmless under cosine-distance HNSW (which is Go vectord's default), but worth fixing in internal/embed/ before extending to euclidean indexes. reports/cutover/ now tracked (joined the scrum/ + reality-tests/ exemptions in .gitignore). Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the real user-facing retrieve path. Embed parity gives that probe a clean foundation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:07:04 -05:00
root	a2fa9a2ce7	scripts/scrum_review: pipe diff via temp files — fixes argv overflow on large bundles `jq --arg` and `curl --data-binary @-` both read stdin/argv-bound buffers. Diffs >~128KB blow past the kernel's argv limit even when piped via stdin (because we still build `body` as a shell variable first, then feed it to curl). Voice-ai full bundle was 156K and hit it. Switch to writing user/system/body to mktemp files, jq reads via --rawfile, curl reads via @file. Same on-the-wire shape, no argv involvement. Cleanup with rm at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:57:34 -05:00
root	68d9e554b0	shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 Adds langfuseMiddleware in internal/shared so every daemon's shared.Run gets free production-traffic trace visibility when LANGFUSE_URL + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY are set. Same env names + file shape as the multi_coord_stress driver, so operators ship one /etc/lakehouse/langfuse.env across the deploy. Wiring is auth-gated: middleware runs INSIDE the RequireAuth group, so 401s from credential-stuffing don't pollute traces. /health is exempt so LB probes don't either. Missing env vars → nil client → middleware is a passthrough no-op (fail-open per ADR-005 5.1). Bundled deploy: - langfuse.env.example template (mode 0640, root:lakehouse) - 11 systemd units gain `EnvironmentFile=-/etc/lakehouse/langfuse.env` (leading - so missing file = OK) - REPLICATION.md bootstrap section documents setup Tests (4): nil passthrough, /health bypass, real-request emission, status-writer wrapping. All green. STATE_OF_PLAY OPEN list: 5 rows → 4 rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:55:42 -05:00
root	5a3364f539	matrix: judge-gated Shape B inject — closes lift-suite tail issues Lift suite run #004 left two unresolved tail issues: - Q6 ("Forklift loader") ↔ Q7 ("Hazmat warehouse, cold storage") swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Distance gate can't tell them apart. - Q9 + Q15 lose paraphrase recovery when qwen2.5 rephrases past the 0.20 threshold. Distance says "drift too far"; sometimes the drift is real (skip), sometimes the paraphrase is still on-domain (don't want to skip). Multi-coord run #008's judge re-rating proved the LLM can distinguish: Q3 crane case landed at distance 0.23 (looks tight) but rating 1 (irrelevant). The judge sees domain mismatch the embedder doesn't. This commit lifts that pattern into the matrix substrate. Shape B inject now optionally routes every candidate through a judge gate before the rank insert lands. Distance + judge BOTH have to approve. internal/matrix/playbook.go: - InjectPlaybookMisses signature gains a query string + an optional InjectGate. nil gate preserves pre-judge-gating behavior (current tests already pass with nil). - New InjectGate interface + InjectGateFunc adapter for tests and non-LLM callers. - Per-candidate gate.Approve(query, hit) call inserted between the dedup and the inject. Rejected candidates skip silently; injected count reflects post-gate decision. internal/matrix/judge.go (new, ~140 lines): - LLMJudgeGate calls an Ollama-shape /api/chat endpoint with the same 1-5 staffing-rubric prompt that worked in multi_coord run #008. fail-closed on HTTP/JSON errors (don't inject if judge can't speak — better miss than wrong-domain). - NewLLMJudgeGate returns nil when URL or Model is empty, matching InjectGate's nil-means-no-judge semantics. internal/matrix/retrieve.go: - SearchRequest gains JudgeURL, JudgeModel, JudgeMinRating fields. Run() builds an LLMJudgeGate when set; passes nil otherwise. Backward compatible — existing callers see no behavior change. Tests: - TestInjectPlaybookMisses_GateRejectsCandidate (rejectAll → 0 injected, even with tight distance) - TestInjectPlaybookMisses_GateApprovesCandidate (approveAll → same as nil-gate behavior) - TestInjectPlaybookMisses_GateSeesCorrectQuery (gate receives CURRENT query + RECORDED query separately so it can score the (current, candidate) pair) - All 5 existing inject tests updated to new signature go test ./internal/matrix → all 8 inject tests pass. go test ./internal/matrix ./internal/shared ./cmd/{matrixd, queryd,pathwayd,observerd} → all green. STATE_OF_PLAY: - OPEN item #1 (judge-gated injection) closed. - DO NOT RELITIGATE adds the substrate-level judge-gate lock. - OPEN list now 5 rows (was 6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:38:12 -05:00
root	247e36e687	STATE_OF_PLAY: trim OPEN list — 9 rows → 6, ordered by product leverage Sprint 4 row removed (shipped: a59ef5b systemd + 54a05d9 docker). ADR-006 row already dropped on the previous STATE update. Two lift-suite tail items (Q6↔Q7 adjacent-query, Q9/Q15 liberal- paraphrase) consolidated into one "judge-gated playbook injection" row — both are downstream of the same fix (let the judge approve before Shape B inserts). Captures the design lineage from multi-coord run #008's judge-rating pattern. Three items folded into a single "operational nice-to-haves" row: real-time clock, chatd fixture storage half, liberal-paraphrase calibration. None are product-blocking; each lights up when someone hits its specific trigger. Reorder reflects leverage on the active product theory (multi- coord staffing co-pilot via the 5-loop substrate), not effort: 1. Judge-gated injection (lift quality + lift-tail closure) 2. Wider Langfuse instrumentation (production observability) 3. Fresh→main merge (operational hygiene as the corpus grows) 4. Distillation full port (production dependency, not yet) 5. Drift quantification (research) 6. Operational nice-to-haves Lead-in note added: "Items move to closed when the work demands them, not on a calendar." Locks intent against future drift toward a sprawling todo list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:32:31 -05:00
root	54a05d9311	Sprint 4 deployment artifacts: Dockerfile + docker-compose Parallel deploy target to the systemd units that landed in a59ef5b. Single image carries all 11 daemons; docker-compose runs one container per daemon with the same dependency graph as the systemd units. Useful when systemd isn't available (Mac dev, remote VMs without root) or when isolation to a private docker network is preferred. Dockerfile (multi-stage): - Builder: golang:1.25-bookworm. DuckDB cgo needs gcc + glibc; alpine's musl doesn't link the official duckdb-go bindings cleanly. - Runtime: debian:bookworm-slim — same libc, much smaller surface. Adds ca-certificates (outbound HTTPS to OpenRouter/OpenCode/Kimi), curl + jq (in-container healthchecks + smoke probes), tini (PID 1 signal forwarding so docker stop sends SIGTERM to the daemon, not to a wrapper). - Single image, multiple binaries. Ships all 11 cmd/* + 3 scripts/ (staffing_workers, playbook_lift, multi_coord_stress) so deployed stacks can run reality tests against themselves. - Non-root runtime user (uid 999 lakehouse). Layout matches /usr/local/bin/lakehouse/<daemon> from REPLICATION.md. - ENTRYPOINT=tini; no default CMD — operators / compose pick which daemon explicitly. docker-compose.yml (11 services): - Same dependency graph as deploy/systemd/. depends_on with service_healthy condition matches Requires= equivalents: catalogd → storaged ingestd → storaged + catalogd queryd → catalogd matrixd → embedd + vectord - Gateway uses bare depends_on (no health condition) — Wants= equivalent so single-upstream restart doesn't cascade. - chatd has per-provider env_file entries (one each for ollama_cloud, openrouter, opencode, kimi) — missing files are silently OK, matching the systemd unit's EnvironmentFile=- list. - Persistent state on the lakehouse-state named volume; commented driver_opts shows how to bind to a host path for off-volume backups. .dockerignore: - Excludes bin/ + reports/ + data/ + git metadata + .env files. - Especially excludes lakehouse.toml/secrets-go.toml/auth.env so local dev configs don't accidentally bake into a published image. REPLICATION.md gains a Docker section between systemd setup and the logs section. Ten-line copy-paste from "git clone" to "docker compose up -d", plus a docker-vs-systemd differences table covering process supervision, logs, restart policy, file ownership, host networking quirks, and backup targets. Validation: docker compose config --quiet → exit 0 (with placeholder env files in place). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:58:47 -05:00
root	a59ef5b930	Sprint 4 deployment artifacts: 11 systemd units + REPLICATION.md + env templates Builds on ADR-006 to ship the operator-facing bits Sprint 4 was blocked on. Single-host deploy is now a documented procedure. deploy/systemd/ (12 files): - 11 .service units, one per daemon. Each follows the same template: Type=simple, User=lakehouse, hardening (NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, ReadWritePaths scoped to /var/lib/lakehouse + /var/log/lakehouse), JSON to journald with per-daemon SyslogIdentifier, EnvironmentFile=- on /etc/lakehouse/auth.env. - Dependency graph baked in via After=/Requires=: storaged → standalone (only network-online) catalogd → Requires storaged ingestd → Requires storaged + catalogd queryd → Requires catalogd matrixd → Requires embedd + vectord gateway → Wants every other daemon (Wants= not Requires= so a single upstream restart doesn't cascade-restart the gateway) pathwayd / observerd / vectord / embedd / chatd → standalone - chatd unit reads 4 cloud-provider EnvironmentFile=s (ollama_cloud / openrouter / opencode / kimi) — each is its own file so per-provider key rotation doesn't restart the others. - lakehouse-go.target: convenience aggregator. Operators systemctl start/stop/enable lakehouse-go.target instead of managing 11 daemons individually. Per-daemon WantedBy= this target. deploy/etc-lakehouse/ (2 templates): - auth.env.example: AUTH_TOKEN per ADR-006 6.2 + rotation playbook comments. The committed file is empty — operators copy + fill in. - secrets-go.toml.example: [s3.primary] template with REPLACE_ME placeholders. Multi-bucket G2 example commented. REPLICATION.md (top-level): - Operator runbook from fresh box → 11 daemons running. - Prereqs (Go 1.25+, gcc, MinIO, Ollama, optionally Langfuse + Postgres for Langfuse) with reachability checks. - Bind ports table (3110–3220, shifted by 10 from Rust legacy). - Bootstrap: useradd → build → install → config → secrets → systemd → validation. - Auth posture matrix (loopback / non-loopback / multi-host / TLS). - Token rotation procedure inline (ADR-006 Decision 6.5). - Logs (journalctl), backup paths, troubleshooting matrix. Validation: systemd-analyze verify passed on all 11 .service files (only "not executable" warnings, expected since binaries don't live at /usr/local/bin/lakehouse/ until step 2 of bootstrap runs). Sprint 4 is now operator-ready. Next: Dockerfile + multi-stage build for container deploys (separate concern; deploy targets either systemd OR docker, not both). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:54:49 -05:00
root	814197cfd3	ADR-006: auth posture for non-loopback deploy + token rotation impl ADR-003 locked the auth substrate; ADR-006 ratifies the operator playbook + adds two implementation pieces needed for Sprint 4 deployment: env-resolved tokens and dual-token rotation. Six decisions locked in docs/DECISIONS.md: - 6.1: Non-loopback bind requires auth.token (mechanical gate at shared.Run, already implemented; this ratifies it). - 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600) loaded by systemd EnvironmentFile=. New TokenEnv field on AuthConfig defaults to "AUTH_TOKEN". - 6.3: AllowedIPs for inter-service same-trust-domain; Token for cross-trust-boundary (gateway ↔ external). - 6.4: /health stays unauthenticated; everything else under shared.Run is gated. Already implemented; ratified here. - 6.5: Token rotation is dual-token. New SecondaryTokens []string on AuthConfig — both primary and any secondary pass auth during the rotation window. Implemented in this commit. - 6.6: TLS terminates at the network edge (nginx/Caddy), not in-process. Daemons stay HTTP-only; internal traffic stays on private subnets per Decision 6.3. Implementation: - internal/shared/config.go: AuthConfig gains TokenEnv + SecondaryTokens fields. New resolveAuthFromEnv() called by LoadConfig fills Token from os.Getenv(TokenEnv) when Token is empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs no TOML config. - internal/shared/auth.go: RequireAuth pre-encodes Bearer headers for primary + every secondary token; per-request constant-time compare walks the slice. Fast path is 1 compare (primary). Tests: - TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name, custom token_env, explicit Token wins over env. - TestRequireAuth_SecondaryTokenAccepted: both primary + secondary tokens pass during rotation window. - TestRequireAuth_SecondaryTokensOnly: only-secondary path works for the case where primary was just promoted-to-empty mid-rotation. go test ./internal/shared all green; existing auth_test.go unchanged (constant-time compare path preserved). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:51:14 -05:00
root	6c93a38093	scrum multi_coord_phase3: 4 fixes from cross-lineage review Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines) produced 4 actionable findings, all defensive hardening. 1. (Opus WARN) internal/langfuse/client.go:queue Synchronous Flush at maxBatch threshold blocked the calling goroutine for the full 5s HTTP timeout when Langfuse hiccupped, defeating the "best-effort, never blocks calling path" contract in the package doc. Now fire-and-forget via goroutine. 2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox - Free-form priority string was accepted; "nonsense" passed through unchecked. Now closed enum: urgent\|high\|medium\|low (+ empty defaults to medium). Tested: TestInbox_RejectsBadPriority. - No size cap on body, only emptiness check; multi-MB payloads would bloat observer's ring + JSONL. Now 8 KiB cap returns 413. Tested: TestInbox_RejectsOversizedBody. - Subject/sender/tag concatenated into InputSummary without newline stripping; embedded \n could corrupt JSONL line-based parsers. New sanitizeInboxField strips \r\n + caps at 256 chars before interpolation. 3. (Opus INFO) scripts/multi_coord_stress/main.go Removed dead `must[T]` generic — tracedSearch took over the fail-fast role for matrix searches, so the helper became unused. 4. (Opus INFO) scripts/multi_coord_stress/main.go:Event `JudgeRating int` collapsed "judge errored" and "judge said unrated" both to 0. Changed to *int — nil = errored, 1-5 = verdict. judgeInboxResult still returns 0 on error; caller gates on > 0 before assigning. Dismissed (with rationale): - Opus WARN ExcludeIDs ordering: verified by code read — filter applies after sort + before top-K truncation as documented; no slot waste possible. - Opus INFO 10 prior-run reports contradict #011: those are point-in-time snapshots; intentional history. - Kimi INFO Langfuse error suppression: design intent (best-effort per package doc). - Kimi INFO contract schema validation: defer until contract count grows enough to make hand-edit drift a real risk. - Kimi INFO paraphrase prompt duplicated across lift + multi_coord: defer (lift to internal/paraphrase/ when a third consumer appears). - Qwen HOLD: single-line, no actionable finding. go test ./cmd/observerd ./internal/langfuse all green; multi_coord driver builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:42:07 -05:00
root	f971e64745	g2_smoke: accept nomic-embed-text* family members as default Pre-push hook caught the regression — the smoke hardcoded MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe in 4da32ad failed the gate. Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and v2-moe are 768d drop-ins; the property the smoke is locking is dim + distinct-vectors, not the exact model variant. Operators swap the variant in lakehouse.toml without needing to touch the smoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:37:20 -05:00
root	db2e57402e	STATE_OF_PLAY: capture multi-coord stress wave (Phase 1-3 verified) Anchor was last touched at v4 split-threshold; since then the multi-coord stress harness landed end-to-end across 11 commits. Future sessions reading this file need to see the verified state, not derive from git log. Major additions: - New "Multi-coordinator stress test (Phase 1 → 3)" section in VERIFIED WORKING. 11-row capability table covering per-coord playbook isolation, diversity metrics, paraphrase handover, ExcludeIDs swap, fresh-resume two-tier, inbox endpoints, LLM demand parsing, judge re-rating, Langfuse tracing. - Substrate-gains list under that section: ExcludeIDs on SearchRequest, observer.SourceInbox + /observer/inbox, internal/langfuse client, embedd default bumped to v2-moe, two-tier fresh_workers index pattern. - Last-verified bumped to 16:42 CDT on the run #011 anchor. DO NOT RELITIGATE expanded with five new locks: 1. Boost / inject use SEPARATE thresholds (0.5 / 0.20) 2. Multi-coord product theory is empirically VALIDATED 3. Fresh content uses two-tier indexing (fresh_workers) 4. embedd.default_model = nomic-embed-text-v2-moe (don't downgrade) 5. Inbox flow: parse + search + judge + trace 6. Langfuse Go-side client lives at internal/langfuse/ OPEN list refresh: - Removed: re-judge metric (shipped as b13b5cd), adjacent-query as separate item (folded into a single "judge-approves-before-inject" follow-up), liberal-paraphrase (kept). - Added: real-time 48-hour clock, wider Langfuse instrumentation, periodic fresh→main merge job. RECENT VERIFIED WAVE table extended with 11 commits (b13b5cd..5d49967). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:30:04 -05:00
root	5d49967833	multi_coord_stress: full Langfuse coverage — every phase + every call Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept. This commit threads tracing through every phase: baseline / fresh- resume / inbox burst / surge / swap / merge / handover (verbatim + paraphrase) / split / reissue. Each phase is a parent span; each matrix.search / LLM call inside is a child span. Refactor: - One run-level trace is created at driver startup. - New startPhase(name, hour, meta) helper emits a phase span as a child of the run trace; subsequent emitSpan calls nest under it. - New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch with span emission. Every search call site replaced with this so the input/output JSON (query, corpora, k, playbook, exclude_n → top-K ids, top1 distance, boost/inject counts) lands in Langfuse. - Phase 4b's paraphrase generation also emits llm.paraphrase spans. - Phase 1c's existing inline span emission converted to use the new helpers (no more inboxTraceID variable). Run #011 result: trace landed at http://localhost:3001 with 111 observations attached. Span breakdown: phase.* parents: 9 (one per phase that ran) matrix.search.baseline: 10 matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh) observerd.inbox.record: 6 llm.parse_demand: 6 matrix.search.inbox: 6 llm.judge_top1: 6 matrix.search.surge: 12 matrix.search.swap_orig: 1 matrix.search.swap_replace: 1 matrix.search.merge: 6 matrix.search.handover_verbatim: 4 llm.paraphrase: 4 matrix.search.handover_paraphrase: 4 matrix.search.split: 4 matrix.search.reissue: 12 matrix.search.reissue_retrieval_only: 12 ───────────── Total: 111 Browse: http://localhost:3001 → Traces → "multi_coord_stress run" Each phase is a collapsible section showing per-call timing and input/output JSON. Operators can drill into any single retrieval to see exactly what query was issued and what came back. All other metrics held: diversity 0.026, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3 at top-1 (two-tier index), 200-worker swap Jaccard 0.000. This is the FULL TEST J asked for — every action in the run visible in Langfuse, full input/output drilldown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:43:32 -05:00
root	08a086779b	multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 Runs #003-#009 surfaced the same finding: fresh workers added mid-run to the main 'workers' vectord index (5K items) reliably absorbed (HTTP 200) but failed to surface in semantic queries even with content-matching prompts. Distances on the verify queries sat at 0.25-0.65 against existing workers; fresh items were beyond top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER on existing items, pushing fresh items further out of reach. Root cause: coder/hnsw incremental adds to a populated graph land in poorly-connected regions and disappear from search traversal. Known property of HNSW post-build adds; not a bug. Fix: two-tier index pattern (canonical NRT search architecture). Fresh content goes to a small "hot" corpus (fresh_workers); main queries include it in the corpora list and merge results. Hot corpus has no recall crowding because it's tiny; periodic batch job (post- G3) merges it into the main index. Implementation: - ensureFreshIndex(hc, gw, name, dim) — idempotent POST /v1/vectors/index. 409 from re-create treated as "already there." - ingestFreshWorker now takes idx parameter so callers can target fresh_workers instead of workers. - multi_coord_stress phase 1b creates fresh_workers index + ingests 3 fresh workers there + searches verifyCorpora=[workers, ethereal_workers, fresh_workers]. Run #010 result: fresh-001 (Senior tower crane rigger NCCCO Chicago) top-1: fresh-001 from fresh_workers, distance 0.143 fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis) top-1: fresh-002 from fresh_workers, distance 0.146 fresh-003 (FAA Part 107 drone surveyor Chicago) top-1: fresh-003 from fresh_workers, distance 0.129 3/3 fresh workers surface at top-1 — the absorption-but-not- findable issue from runs #003-#009 is closed. All other metrics held: diversity 0.007, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000, inbox burst all 6 events accepted + traced to Langfuse. This is the final structural fix for the multi-coord stress suite. Phase 3 is feature-complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:31:45 -05:00
root	7e6431e4fd	langfuse: Go-side client + Phase 1c instrumentation The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs); this commit lands Go-side parity so the multi-coord stress harness can emit traces visible at http://localhost:3001. internal/langfuse/client.go: - Minimal Trace + Span + Flush API mirroring what the Rust emitter uses. Auth: Basic over public_key:secret_key. - Best-effort posture: errors are slog.Warn'd, never block calling paths. Same fail-open as observerd's persistor (ADR-005 Decision 5.1) — observability is a witness, not a gate. - Events buffered until 50, then auto-flushed; explicit Flush() at process exit. - Each Trace/Span returns its id so callers can build hierarchies. multi_coord_stress driver wiring: - New --langfuse-env flag (default /etc/lakehouse/langfuse.env). Empty / missing / unparseable file → skip tracing with a logged warning; run still proceeds. - Phase 1c (inbox burst) now emits one parent trace + 4 spans per inbox event: 1. observerd.inbox.record (post to /v1/observer/inbox) 2. llm.parse_demand (qwen2.5 → structured fields) 3. matrix.search (parsed query → top-K) 4. llm.judge_top1 (rate top-1 vs original body) Each span carries input/output JSON + start/end times so the Langfuse UI shows a full waterfall per event. Run #009 result: Trace landed: "multi_coord_stress phase 1c inbox burst" Observations attached: 24 (= 6 events × 4 spans) Tags: stress, phase-1c, inbox Browseable at http://localhost:3001 by tag query. Other harness metrics: diversity 0.016, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4 — all unchanged by the tracing addition (best-effort post in parallel). Phase 1c is the proof-of-concept; future commits can wrap other phases (baseline / merge / handover / split) in traces too. Once that's done, the entire stress run becomes scrubbable in Langfuse without grepping the events JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:25:03 -05:00
root	ce940f4a14	multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:16:49 -05:00
root	186d209aae	multi_coord_stress: LLM-parsed inbox demands (qwen2.5) Replaced the hard-coded DemandQuery on inbox events with an actual LLM call: each email/SMS body is parsed by qwen2.5 (format=json, schema-anchored) into structured {role, count, location, certs, skills, shift}. The driver then composes a query string from those fields and runs matrix.search. This is the real-product flow that the Phase 3 stress test was asking for: real bodies → real LLM parsing → real search. Before this commit, the DemandQuery was my hand-crafted string, which made the inbox phase trivial. Run #007 result vs #006 (same bodies, parser swapped): All 6 inbox events parsed cleanly — qwen2.5 nailed: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required." → {role:"forklift operator", count:50, location:"Cleveland, OH", certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"} Other 5 similarly faithful (indy stayed as "indy", count defaulted to 1 when unspecified, no hallucinated fields). LLM-parsed queries produced TIGHTER matches than hard-coded: Demand #006 dist #007 dist Δ Crane Chicago 0.499 0.093 -82% Drone Chicago 0.707 0.073 -90% Bilingual safety 0.240 0.048 -80% Forklift Cleveland 0.330 0.273 -17% Production Indy 0.260 0.399 +53% Warehouse Milwaukee 0.458 0.420 -8% Three matches landed at distance < 0.10 — verbatim-replay-tight territory. Structured queries embed sharper than conversational hand-crafted strings. Other metrics unchanged: diversity 0.000, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4. Tradeoff worth flagging: the drone-Chicago case dropped from distance 0.71 (clear "we don't have one") to 0.07 (confident match returned). The OOD honesty signal weakens when LLM-parsed structure makes any closest-neighbor look tight. Future Phase 4 work: judge re-rates the top match before surfacing, so coordinators see "your demand was for X but the closest match scored 2/5" rather than just the worker ID + distance. Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5). Production would amortize via a small dedicated parser model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:51:19 -05:00
root	e7fc63b216	observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events) Phase 3 ask: real-world inbox-style event injection during the stress test. Coordinators in production receive emails + SMS that trigger contract responses; the substrate has to RECORD these signals AND react with a search using the embedded demand. This commit lands the endpoint and exercises it end-to-end in the stress harness. observerd surface: - New POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp with Source=SourceInbox. Type must be email\|sms; body required; priority defaults to medium. The handler ONLY records — downstream triggers (search, ingest, etc.) are the caller's concern, recorded separately. Keeps the witness role pure. - New observer.SourceInbox = "inbox" alongside SourceMCP / SourceScenario / SourceWorkflow. - Three contract tests on the new route (happy path / bad type / empty body), router-mount test extended, all green. Stress harness phase 1c (Hour 9): - 6 inbox events fire in priority order (urgent → high → medium): 2 urgent emails (forklift Cleveland, production Indianapolis) 1 high email (crane Chicago) 1 high sms (bilingual safety Indianapolis) 1 medium sms (drone Chicago) 1 medium email (warehouse Milwaukee FYI) - Each event: 1. POSTs to /v1/observer/inbox (recorded by observerd) 2. Triggers matrix.search using a parsed demand (the demand extraction is hard-coded for now; production needs a small LLM to parse from body) 3. Captures both as events in the run JSON Run #006 result (with v2-moe embedder + all phases including inbox): Diversity: Same-role-across-contracts Jaccard = 0.000 (n=9) Different-roles-same-contract Jaccard = 0.046 (n=18) Determinism: 1.000 Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) Inbox burst: 6/6 events accepted by observerd (200 status, all recorded) 6/6 triggered searches produced distinct top-1 worker IDs distance distribution: 0.24 (Indy production) → 0.71 (Chicago drone surveyor — honest stretch since drones aren't in the 5K-worker corpus, system surfaces closest neighbor at high distance rather than fabricating) The drone-Chicago case is the architectural-honesty signal: when the demand asks for a specialist NOT in the roster, the system returns the closest semantic neighbor with a distance that flags "this is a stretch." Coordinators reading distances see "we don't have a great match here" rather than a confident wrong answer. Total events captured: 67 (was 61 pre-inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:34:36 -05:00
root	4da32ad102	embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) Local Ollama has three embedding models loaded: nomic-embed-text:latest 137M 768d (previous default) nomic-embed-text-v2-moe:latest 475M 768d (this commit's default) qwen3-embedding:latest 7.6B 4096d (would require dim change) v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE architecture. Workers index doesn't need rebuilding, just future ingests embed with the stronger model. Run #005 result on the multi-coord stress suite: Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9) → MoE is more discriminating: zero worker overlap across Milwaukee / Indianapolis / Chicago for shared role names. The geo + cert + skill context fully separates worker pools. Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff) Determinism: 1.000 (unchanged) Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) 200-worker swap: Jaccard 0.000 (unchanged — still perfect) Fresh-resume verify: STILL doesn't surface fresh workers in top-8. With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39) — the embedder is MORE discriminating, but the fresh worker's vector still doesn't outrank the 8th-best existing worker. Now suspect of being an HNSW post-build add issue (coder/hnsw incremental adds can land in hard-to-reach graph regions, not an embedder problem). Better embedder didn't fix it; needs a different strategy: full index rebuild after fresh adds, or explicit playbook-layer score boost for fresh workers, or hybrid (keyword + semantic) retrieval. Phase 3 investigation. Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s). Acceptable for the quality jump on diversity. Real production with incremental ingest won't pay this once-per-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:26:52 -05:00
root	84a32f0d29	multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap Three Phase 2 additions land in this commit: 1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific worker IDs out of results post-retrieval, AND skips them at the playbook boost+inject step (so excluded answers can't sneak back via Shape B). Real-world driver: coordinator placed N workers, client asks for replacements, system needs alternatives, not the same N. Threaded through retrieve.go after merge but before metadata filter so excluded IDs don't waste post-filter top-K slots. 2. New harness phase 2b: 200-worker swap simulation. Captures the top-K from alpha's warehouse query, then re-issues with exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether the substrate finds genuine alternatives. 3. New harness phase 1b: fresh-resume mid-run injection. Three new workers ingested via /v1/embed + /v1/vectors/index/workers/add, then verified findable via semantic queries matching resume content. Plus Hour labels on every event (operational narrative: 0/6/12/18/ 24/30/36/42/48) and a refactor of captureEvent to take hour as a param. Run #003 + #004 results (5K workers + 10K ethereal): Diversity (#004): Same-role-across-contracts Jaccard = 0.080 (n=9) Different-roles-same-contract Jaccard = 0.013 (n=18) Determinism: 1.000 (#004 unchanged) Verbatim handover: 4/4 = 100% Paraphrase handover: 4/4 = 100% Phase 2b — 200-worker swap (Jaccard 0.000): 8 originally-placed workers fully replaced by 8 alternatives. ExcludeIDs substrate change works end-to-end — boost AND inject both honor the exclusion, so excluded workers don't return via the playbook either. Phase 1b — fresh-resume injection: REAL PRODUCT FINDING. Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add calls at 200 status, 3 vectors persisted. But none of the 3 fresh workers surfaced in top-8 even with semantic queries matching their resume content (e.g. "Senior tower crane rigger NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12 years tower-crane signaling..." NCCCO + Chicago). Top-1 came from existing workers at distance ~0.25; fresh workers' distances must be > 0.25, pushing them past rank 8. Cause: dense retrieval at 5000+ workers means many existing profiles cluster near any specific query in cosine space; nomic-embed-text-v2 (137M) introduces enough noise that a fresh worker doesn't reliably outrank them just because the text content overlaps. Workarounds (Phase 3 work): (a) hybrid retrieval (keyword + semantic), (b) playbook-layer score boost for fresh adds, (c) larger embedder. Documented in run #004 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:19:29 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00
root	b13b5cd7a1	playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14% The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:42:04 -05:00
root	87cbd10090	STATE_OF_PLAY: v4 split-threshold result + adjacent-query observation - Reality test table extends from #001-#003 to #001-#004; v4 row marked as "the honest configuration" because OOD cross-pollination is gone. - Shape B section gains the split-threshold rationale (boost safe at loose, inject structurally riskier so tighter). - Verbatim drop framing rewritten — v3→v4 is configuration evolution, not regression. - OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math item (Shape B + split threshold addressed both). Replaced with two finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be correct, verify with v4 re-judge metric) and liberal-paraphrase recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20). - RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:26:23 -05:00

1 2 3

131 Commits