golangLAKEHOUSE

Author	SHA1	Message	Date
root	c164a3da96	g5 cutover: production load test — 0 errors / 101k req · Go direct = 2,772 RPS Sustained-traffic load test against the cutover slice. Three runs, zero correctness errors across 101,770 total requests. Substrate holds up under concurrent load — matrix gate, vectord HNSW, embedd cache, gateway proxy all hold. This was the load test's primary question; latency numbers are secondary. scripts/cutover/loadgen — focused Go load generator. 6-query rotating body mix (Forklift/CNC/Warehouse/Picker/Loader/Shipping). Configurable URL/concurrency/duration. Reports per-status-code counts + p50/p95/p99 latencies + JSON summary on stderr. Three runs: baseline (Bun → Go, conc=1, 10s): 4,085 req · 408 RPS · p50 1.3ms · p99 32ms · max 215ms sustained (Bun → Go, conc=10, 30s): 14,527 req · 484 RPS · p50 4.6ms · p99 92ms · max 372ms direct (→ Go, conc=10, 30s): 83,158 req · 2,772 RPS · p50 2.5ms · p99 8.5ms · max 16ms Critical findings: 1. ZERO correctness errors across 101k requests. No 5xx, no transport errors, no panics. Concurrency-safety verified across matrix gate / vectord / gateway / embedd cache. 2. Direct-to-Go is production-grade. 2,772 RPS at p99 8.5ms on a single host, no scaling cliff at concurrency=10. 3. Bun frontend is the bottleneck. -82% RPS, +982% p99 vs direct. Single-process JS event loop queueing under concurrent requests — known Bun proxy-mode characteristic. The substrate itself isn't the limiter. 4. For staffing-domain demand levels (<1 RPS typical per coordinator), Bun-fronted 484 RPS has 480× headroom. No urgency to optimize Bun out of the data path. If/when concurrent demand grows orders of magnitude, the path is nginx → Go direct for hot endpoints, skip Bun. Substrate is now load-tested and verified production-ready. What this load test does NOT cover (documented in g5_load_test.md): cold-cache embed, larger corpus, mixed read/write, multi-host, full 5-loop traffic with judge gate calls. Each is its own probe shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:20:41 -05:00
root	4fd560cad6	start_go_stack.sh: third isolation layer (port range :4xxx for persistent) Earlier push exposed the gap in the previous 2-layer isolation: smokes still failed because they tried to bind :3211-:3220 which my persistent stack already had. Smoke catalogd's bind-failure went undetected because poll_health 3212 succeeded responding to the persistent catalogd, and smoke proceeded against the wrong backend with the wrong bucket expectations. Fix: persistent stack now uses :4110 + :4211-:4219 via additional sed in the temp toml (bind addresses + upstream URLs). Smoke harnesses keep :3110 + :3211-:3219. Both reach the SAME chatd at :3220 because chatd is read-mostly (no state to clobber) and operators don't want to maintain two LLM provider key sets. Three isolation layers now in effect: 1. Binary names (bin/persistent-* via symlinks) 2. MinIO buckets (lakehouse-go-persistent vs lakehouse-go-primary) 3. Port range (:4xxx vs :3xxx, with shared chatd on :3220) Verified pre-push: - 11 persistent ports listening on :4xxx + :3220 - 0 smoke ports listening on :3110-:3219 (free for smokes) Pushed while persistent stack live — first cross-isolation test (no port collision, no bucket collision, no name collision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:26:41 -05:00
root	c48b58ff8d	start_go_stack.sh: 2-layer isolation from smoke harness The 2026-05-01 persistent-stack milestone exposed two collision modes between the long-running Go stack and the pre-push smoke harness: 1. PKILL COLLISION: smoke teardown uses anchored `pkill -f "bin/(storaged\|...\|gateway)$"`. Same-named persistent processes match → smokes kill 7 of 11 persistent daemons. 2. MINIO STATE COLLISION: persistent stack writes `_vectors/workers.lhv1` to the shared lakehouse-go-primary bucket. Smoke vectord rehydrates from same bucket → sees both smoke-owned and persistent-owned indexes → assertion failures. Both fixed in this commit by adding two isolation layers: LAYER 1 — distinct binary names via symlink: bin/persistent-<daemon> → bin/<daemon> Persistent stack runs as ./bin/persistent-gateway etc. Smoke pattern `bin/(name)$` matches `bin/gateway$` but NOT `bin/persistent-gateway$` (regex group requires bin/ followed immediately by a daemon name; "bin/p..." doesn't qualify). Cmdline lookup verified: 7 persistent procs, 0 match smoke pkill. LAYER 2 — separate MinIO bucket via temp config: Persistent stack writes to lakehouse-go-persistent (configurable via $LH_PERSISTENT_BUCKET). Temp toml at /tmp/lakehouse-persistent.toml inherits everything from lakehouse.toml except [s3].bucket which is sed-replaced. Bucket auto-created via mc if missing. Verified: workers.lhv1 lands in persistent bucket; primary bucket _vectors/ stays empty. Net effect: the persistent stack should survive `git push` (which runs smokes that rehydrate vectord from primary bucket and pkill their own bin/<name>$ daemons). This commit is the first push test WITH the persistent stack live. Caveat: bin/persistent-* symlinks are gitignored already (/bin/ is in .gitignore wholesale), so the symlinks need to be created on each fresh checkout — which start_go_stack.sh does idempotently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:20:00 -05:00
root	54b2e7db76	start_go_stack.sh: document smoke-vs-persistent-stack pkill conflict Caught immediately after the prior commit pushed: pre-push smokes killed 7 of 11 persistent Go daemons because the smokes' anchored `pkill -f "bin/(name)$"` teardown matches ANY process named `bin/<daemon>`, not just the smokes' own children. Documented in the script header as a KNOWN CONSTRAINT with a workaround (re-run start_go_stack.sh after every push) and a proper-fix sketch (give the persistent stack a different binary name via build tag or symlink). Proper fix deferred until trigger fires — operators living through this once will know to want it. Persistent stack restored (all 11 healthy as of this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:56:52 -05:00
root	09904d5222	cutover: persistent Go stack milestone — first long-running deployment + first Go-emitted audit_baselines entry J's "let's go" instruction: leave OPEN list behind, push the Go substrate forward into actual deployment shape. This commit marks the first time the Go side has run as long-running daemons rather than per-harness transient processes, and the first time the shared cross-runtime longitudinal log has carried a Go-emitted entry alongside the Rust ones. What landed: scripts/cutover/start_go_stack.sh — the persistent-stack runbook. Brings up all 11 daemons (storaged → catalogd → ingestd → queryd → embedd → vectord → pathwayd → observerd → matrixd → gateway, plus chatd-if-not-already-up) in dependency order via nohup + disown. Anchored pkill per feedback_pkill_scope (never bare "bin/"). Logs land in /tmp/gostack-logs/<bin>.log, one per daemon. Verified live state: - All 11 services healthy on :3110 + :3211-:3220 - gateway → embedd proxy returns nomic-embed-text-v2-moe vectors - chatd reports 5/5 providers loaded - No port collision with Rust gateway on :3100 - Daemons stay up after exit of the start script (production shape, not harness-transient) audit_baselines.jsonl crosses the runtime boundary: - 7 Rust-emitted entries (last: ca7375ea 2026-04-27) - 1 Go-emitted entry (ee2a40c 2026-05-01T07:53:54Z) appended via ./bin/audit_full -append-baseline - Same envelope shape, same metric set, same drift comparator semantics — operators running either runtime grow the same log What this DOES prove: - Substrate parity at deployment shape (not just unit tests) - Cross-runtime artifact write-side compatibility (was previously proven on read side via audit_baselines roundtrip) - The deploy machinery works end-to-end for the persistent case What this does NOT prove (still ahead): - Real coordinator traffic against the Go stack (no nginx flip yet; devop.live/lakehouse/ still serves through Rust) - Go-side production materializer (Phase 2 is observer-only) - Replay tool parity (Phase 7 is observer-only) - The 5-loop product gate against actual humans reports/cutover/SUMMARY.md now logs three new rows: - audit-FULL with 12/12 phases ported - First Go-emitted audit_baselines entry - Persistent Go stack live Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:55:29 -05:00
root	0d4f033b34	audit_baselines: round-trip validation against live Rust data Same shape of proof as embed_parity.sh for the embed endpoint: take the just-shipped Go port (ca142b9) and validate it against the actual production data the Rust legacy emits, not just unit- test fixtures. Locks the cross-runtime parity that operators running mixed pipelines depend on. scripts/cutover/audit_baselines_validate.go: - Reads /home/profit/lakehouse/data/_kb/audit_baselines.jsonl - Parses every entry via the Go AuditBaseline struct - Round-trips the last entry: encode → decode → field-by-field equality check (catches any silently-dropped JSON keys) - Calls LoadLastBaseline against the live file (proves the public API works on real shapes, not just inline parsing) - Computes BuildAuditDriftTable(first → last) — full-window lineage drift over the captured baselines Live-data probe results (reports/cutover/audit_baselines_roundtrip.md): - 7 entries parse without error - Round-trip is byte-equal on every metric + every header field - Drift table fires the expected verdicts: - p2_evidence_rows 12→82 (+583%) → warn (above 20% threshold) - p3_accepted/partial/rejected/human 0→non-zero → warn (the zero-baseline edge case TestBuildAuditDriftTable_ZeroBaseline was designed to lock — verified now firing on real history) - p4_* metrics +0% → ok (stable across the window) What this does NOT prove (documented in the report): the Go-side audit-FULL pipeline that PRODUCES baselines doesn't exist yet. Only the load/append/drift substrate is ported. Operators running audit-full from Go would still need a metric-collection pass — that's a separate port deliberately not in this wave. reports/cutover/SUMMARY.md gains a new row alongside the embed parity entries; cutover-prep verification log keeps the discipline of "verified against real data, not just fixtures." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:20:18 -05:00
root	b216b7e5b6	fix the other 4: close all OPEN-list items in one wave Substantial wave addressing all 4 prior OPEN items. Three closed in full, one partially (the speculative half deliberately deferred). OPEN #1 — Periodic fresh→main index merge (FULL): - POST /v1/vectors/index/{src}/merge with {dest, clear_source} - Idempotent on re-runs (existing-in-dest items skipped) - internal/vectord/index.go: new Index.IDs() snapshot method + i.ids tracker field as canonical ID set, independent of meta map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta alone missed items added with nil metadata) - 4 cmd-level integration tests (happy path drain+clear, dim mismatch, dest not found, self-merge rejection) + 1 unit test - DecodeIndex backward-compat: old envelopes restore i.ids from meta keys (best effort; new items going forward use the tracker) OPEN #2 — Distillation SFT export (SUBSTRATE): - internal/distillation/sft_export.go ports the load-bearing half: IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/ MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft. - Synthesis (instruction/input/response generation) deferred to a separate wave — too big for this session, but the substrate makes the next wave a port-not-design exercise. - TestSftNever_PinsExpectedSet locks the contamination firewall set: if a future commit adds/removes from SftNever, this test fails — forcing the change through review. - 5 new tests; firewall fires end-to-end through the partial port. OPEN #3 — Distribution drift via PSI (FULL): - internal/drift/drift.go: ComputeDistributionDrift via Population Stability Index. Standard finance/risk metric, well-defined verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). - Equal-width bucketing over combined min/max so neither dist falls outside; epsilon-clamping for empty buckets so log doesn't blow up. Per-bucket breakdown for drilldown. - Pairs with the existing ComputeScorerDrift: scorer drift is categorical, distribution drift is continuous. Different shapes, same package. - 7 new tests covering identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical- safe, bucket-counts-conserved, num-buckets-clamping. OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others deferred): - (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`). Output.PhaseTimings + Output.TotalElapsedMs in JSON. - (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger, would be speculative. Documented as deferred-until-need rather than ignored. Per the project's discipline ("don't add features beyond what the task requires"). OPEN list now empty / steady-state. Future items will land as production triggers fire. Build + vet + tests green; 18 new tests across the 4 closures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:42:11 -05:00
root	356d76b4b0	multi_coord_stress: thread role through matrix retrieve + playbook record Real wire-up gap discovered post-scrum: Demand.Role was already extracted at every call site in multi_coord_stress (44 occurrences, both contract-driven and LLM-parsed inbox-triggered paths), but neither matrixSearch nor playbookRecord accepted role in their signatures. Cross-role gate (real_001..real_004 work) was bypassed for the entire multi-coord harness — recordings and queries went through with empty role, gate fell back to lenient behavior. Fix: - matrixSearchReq gains query_role field - matrixSearch signature: (..., query, role string, ...) - tracedSearch wrapper gains role param + emits it in span input metadata for Langfuse visibility - playbookRecord signature: (..., query, role, ...) — body emits role only when non-empty (preserves backward compat at API) - 14 call sites updated: contract-driven Demand loops → d.Role LLM-parsed inbox path → parsed.Role (qwen2.5 already extracts it) swap path (warehouseDemand) → warehouseDemand.Role reissue path → ev.Role (captured at original event time) fresh-verify (resume snippet, no role concept) → "" Build clean, vet clean, all tests pass. Cross-role gate now fires end-to-end across the multi-coord harness — matches the playbook_lift harness's coverage from the original real_001 fix. This closes the symmetric gap to scripts/playbook_lift's existing wire-through. Both production-shape harnesses now exercise the role gate; future reality tests automatically inherit the protection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:10:49 -05:00
root	0331288641	playbook_lift: LLM-based role extractor closes shorthand bleed (real_004) real_003 left a known-weak hole: shorthand-style queries ("{count} {role} {city} {state} ...") have no separator between role and city, so a regex can't reliably extract — leaving the cross-role gate disabled when both record AND query are shorthand. This commit adds a roleExtractor with regex-first + LLM fallback: - Regex first (fast, deterministic) — handles need + client_first + looking from real_003b. ~75% of styles, no LLM cost paid. - LLM fallback when regex returns empty AND model is configured — Ollama-shape /api/chat with format=json, schema-tight prompt, temperature 0. ~1-3s on local qwen2.5. - Per-process cache — paraphrase + rejudge passes reuse the same query 4× per run; cache prevents 4× LLM cost. - Off-by-default — opt-in via -llm-role-extract flag (CLI) and LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping config unchanged unless explicitly enabled. 8 new tests in scripts/playbook_lift/main_test.go: - TestRoleExtractor_RegexFirst: LLM not called when regex matches - TestRoleExtractor_LLMFallback: shorthand goes to LLM - TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved - TestRoleExtractor_Cache: 3 calls = 1 LLM hit - TestRoleExtractor_NilSafe: nil receiver runs regex only - TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths - TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic witness for the real_003 scenario — both record + query are shorthand, regex returns "" for both, LLM produces DIFFERENT role tokens for CNC vs Forklift, so matrix gate's cross-role rejection (locked separately in TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires correctly. This is the load-bearing verification. Reality test real_004 ran the same 40-query stress as real_003 with LLM extraction on. Cross-style same-role boosts fired correctly across all 4 styles for Loaders + Packers + Shipping Clerk clusters (including shorthand → other-style transfer). No cross-role bleed observed. The reality test alone can't be a clean "with vs without" comparison (HNSW build is non-deterministic across runs, and real_004 stochastics didn't trigger a shorthand recording at all), which is why the unit-test witness exists. Production note (in real_004_findings.md): LLM extraction is for reality-test coverage of arbitrary query shapes. Production should extract role at INGEST time (when the inbox parser already runs an LLM) and pass already-resolved role through requests — same shape as multi_coord_stress's existing Demand{Role: ...} model. The hot path should never need the harness extractor's per-query LLM cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:51:27 -05:00
root	3263254f1c	reality_test real_003: 40-query paraphrase stress + extractor extension Stress-tests the role gate with 40 queries (10 fill_events rows × 4 styles): need, client_first, looking, shorthand. Each row's role + client + city stays the same; only the surface phrasing changes. real_003 (original extractor) confirmed the shorthand-vs-shorthand failure mode: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query within the same Beacon Freight Detroit cluster. Both record + query had empty role (extractor returns "" for shorthand because there's no separator between role and city), gate disabled, distance check passed, bleed fired. Fix: extended extractRoleFromNeed to handle client_first ("{client} needs N {role} in...") and looking ("Looking for N {role} at...") patterns. Shorthand left intentionally unmatched — "Forklift Operator Detroit" is shape-indistinguishable from "Forklift" + "Operator Detroit" without an LLM extractor or known- cities lookup. real_003b (extended extractor) verifies bleed closed across all 4 styles for this dataset. Forklift Operator queries keep w-2136 (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. scripts/cutover/gen_real_queries.go: added -styles flag with values need\|client_first\|looking\|shorthand\|all (default need preserves real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is the 40-query stress file. scripts/playbook_lift/main_test.go: 10 sub-tests lock the four documented patterns + shorthand limitation + lift-suite-style queries (no clean role, returns empty as expected). Aggregate metrics: - real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108 - real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202 The growth reflects more LEGITIMATE same-role same-cluster transfer firing across styles, not bleed (verified by per-cluster bleed table — Forklift Operator queries unchanged across all 4 styles). Known limitation documented in real_003_findings.md: same-cluster, same-role queries in shorthand still embed close enough that a shorthand recording could bleed onto a different-role shorthand query if both record + query strip role. Closing this requires LLM extraction or known-cities lookup at record + query time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:42:02 -05:00
root	997527be4d	matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1 ) real_001 surfaced same-client+city queries bleeding across roles: Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) embedded within 0.13-0.18 cosine of Q#2's query — well inside the 0.20 inject threshold — so e-6193 injected on both, demoting the cold-pass-correct workers. Root cause: the inject distance threshold isn't tight enough on the same-client+city cluster. Cosine collapses queries that share city + client + count-token + time-token regardless of role. The existing judge gate is per-injection at record time and doesn't fire at retrieve time. Fix: structural role gate in front of both Shape A boost and Shape B inject. PlaybookEntry gains Role; SearchRequest gains QueryRole. When both are non-empty and differ under roleEqual's case+plural normalization, the entry is rejected before BoostFactor or judge-gate logic runs. Backward-compat: empty role on either side disables the gate — preserves behavior for the lift suite's free-form multi-constraint queries that have no clean single role. Caller-supplied (not inferred), so existing recordings unaffected. Wire-through: - internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole, roleEqual helper with plural+case normalization - internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded to both ApplyPlaybookBoost + InjectPlaybookMisses - cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk - scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls role from "Need N {role}{s} in" queries (the fill_events shape); free-form queries fall back to empty (gate disabled) Tests (5 new): - TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10 scenario (distance 0.135, recorded "Forklift Operator", query "CNC Operator") — locks the bleed at unit level - TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator recording fires on Forklift Operators query (plural normalization) - TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on either side = gate disabled, preserves current behavior - TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense in depth — boost doesn't fire on cross-role even when answer is in cold top-K - TestRoleEqual_PluralAndCase: case + -s + -es plural normalization Verification (real_002, same query set as real_001): - Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed) - Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed) - Discoveries + lifts unchanged at 2 each (same-role lift still fires) - Mean Δdist tightens from -0.127 to -0.040 (boosts no longer pulling distances through the floor on cross-role mismatches) Findings: reports/reality-tests/real_002_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:34:10 -05:00
root	7f2f112e6a	reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed First retrieval probe with non-synthetic query distribution. Pulls N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet (real-shape demand data) and translates each to the natural language a coordinator would type: "Need {count} {role}s in {city} {state} starting at {at} for {client}". Headline: 8/10 cold-pass top-1 = judge-best on real distribution. Substrate works on queries it was never trained for. v2-moe + workers corpus carry the load. Surfaced finding (the real value of running this): same-client+city queries cluster, and Shape A's distance boost bleeds across roles within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1 even though: - Neither query has its own recorded playbook. - Neither warm pass triggers a Shape B inject (boosted=0). - The roles are different staffing categories. Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) for a worker who was approved by the judge for a different role on a different query. Why the lift suite missed it: synthetic queries use 7 disjoint scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand clusters on (client, city). The cluster doesn't exist in the synthetic distribution. Why the judge gate doesn't catch it: the gate (5a3364f) is per-injection at record time. After approval the worker rides Shape A distance boosts on all later same-cluster queries with no second gate call. Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. Cheap; doesn't need new judge calls. Files: - scripts/cutover/gen_real_queries.go: parquet → coordinator NL - tests/reality/real_coord_queries.txt: 10 generated queries - reports/reality-tests/playbook_lift_real_001.md: harness output - reports/reality-tests/real_001_findings.md: the reading Repro: go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:18:40 -05:00
root	5687ec65c2	G5 cutover prep: embed parity probe — Rust /ai/embed ↔ Go /v1/embed verified First concrete cutover artifact: scripts/cutover/embed_parity.sh brings up Go embedd + gateway alongside the live Rust gateway, hits both /ai/embed and /v1/embed with the same forced model, and emits a per-date verdict report under reports/cutover/. Why embed first: the parity invariant is one math identity (cosine sim of vectors against same input). Retrieve has thousands of edge cases. If embed parity holds, all downstream vector consumers inherit confidence; if it doesn't, we catch it in 30s instead of after a flip. Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both Ollamas have it loaded). Math is provably equivalent across the gateway plumbing. Drift catalog (reports/cutover/SUMMARY.md): - URL: Rust /ai/embed vs Go /v1/embed - Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors, dimension} (singular). Wire-format adapter is the only real cutover work for this endpoint. - L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same direction (cos=1.0); harmless under cosine-distance HNSW (which is Go vectord's default), but worth fixing in internal/embed/ before extending to euclidean indexes. reports/cutover/ now tracked (joined the scrum/ + reality-tests/ exemptions in .gitignore). Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the real user-facing retrieve path. Embed parity gives that probe a clean foundation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:07:04 -05:00
root	a2fa9a2ce7	scripts/scrum_review: pipe diff via temp files — fixes argv overflow on large bundles `jq --arg` and `curl --data-binary @-` both read stdin/argv-bound buffers. Diffs >~128KB blow past the kernel's argv limit even when piped via stdin (because we still build `body` as a shell variable first, then feed it to curl). Voice-ai full bundle was 156K and hit it. Switch to writing user/system/body to mktemp files, jq reads via --rawfile, curl reads via @file. Same on-the-wire shape, no argv involvement. Cleanup with rm at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:57:34 -05:00
root	6c93a38093	scrum multi_coord_phase3: 4 fixes from cross-lineage review Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines) produced 4 actionable findings, all defensive hardening. 1. (Opus WARN) internal/langfuse/client.go:queue Synchronous Flush at maxBatch threshold blocked the calling goroutine for the full 5s HTTP timeout when Langfuse hiccupped, defeating the "best-effort, never blocks calling path" contract in the package doc. Now fire-and-forget via goroutine. 2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox - Free-form priority string was accepted; "nonsense" passed through unchecked. Now closed enum: urgent\|high\|medium\|low (+ empty defaults to medium). Tested: TestInbox_RejectsBadPriority. - No size cap on body, only emptiness check; multi-MB payloads would bloat observer's ring + JSONL. Now 8 KiB cap returns 413. Tested: TestInbox_RejectsOversizedBody. - Subject/sender/tag concatenated into InputSummary without newline stripping; embedded \n could corrupt JSONL line-based parsers. New sanitizeInboxField strips \r\n + caps at 256 chars before interpolation. 3. (Opus INFO) scripts/multi_coord_stress/main.go Removed dead `must[T]` generic — tracedSearch took over the fail-fast role for matrix searches, so the helper became unused. 4. (Opus INFO) scripts/multi_coord_stress/main.go:Event `JudgeRating int` collapsed "judge errored" and "judge said unrated" both to 0. Changed to *int — nil = errored, 1-5 = verdict. judgeInboxResult still returns 0 on error; caller gates on > 0 before assigning. Dismissed (with rationale): - Opus WARN ExcludeIDs ordering: verified by code read — filter applies after sort + before top-K truncation as documented; no slot waste possible. - Opus INFO 10 prior-run reports contradict #011: those are point-in-time snapshots; intentional history. - Kimi INFO Langfuse error suppression: design intent (best-effort per package doc). - Kimi INFO contract schema validation: defer until contract count grows enough to make hand-edit drift a real risk. - Kimi INFO paraphrase prompt duplicated across lift + multi_coord: defer (lift to internal/paraphrase/ when a third consumer appears). - Qwen HOLD: single-line, no actionable finding. go test ./cmd/observerd ./internal/langfuse all green; multi_coord driver builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:42:07 -05:00
root	f971e64745	g2_smoke: accept nomic-embed-text* family members as default Pre-push hook caught the regression — the smoke hardcoded MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe in 4da32ad failed the gate. Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and v2-moe are 768d drop-ins; the property the smoke is locking is dim + distinct-vectors, not the exact model variant. Operators swap the variant in lakehouse.toml without needing to touch the smoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:37:20 -05:00
root	5d49967833	multi_coord_stress: full Langfuse coverage — every phase + every call Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept. This commit threads tracing through every phase: baseline / fresh- resume / inbox burst / surge / swap / merge / handover (verbatim + paraphrase) / split / reissue. Each phase is a parent span; each matrix.search / LLM call inside is a child span. Refactor: - One run-level trace is created at driver startup. - New startPhase(name, hour, meta) helper emits a phase span as a child of the run trace; subsequent emitSpan calls nest under it. - New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch with span emission. Every search call site replaced with this so the input/output JSON (query, corpora, k, playbook, exclude_n → top-K ids, top1 distance, boost/inject counts) lands in Langfuse. - Phase 4b's paraphrase generation also emits llm.paraphrase spans. - Phase 1c's existing inline span emission converted to use the new helpers (no more inboxTraceID variable). Run #011 result: trace landed at http://localhost:3001 with 111 observations attached. Span breakdown: phase.* parents: 9 (one per phase that ran) matrix.search.baseline: 10 matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh) observerd.inbox.record: 6 llm.parse_demand: 6 matrix.search.inbox: 6 llm.judge_top1: 6 matrix.search.surge: 12 matrix.search.swap_orig: 1 matrix.search.swap_replace: 1 matrix.search.merge: 6 matrix.search.handover_verbatim: 4 llm.paraphrase: 4 matrix.search.handover_paraphrase: 4 matrix.search.split: 4 matrix.search.reissue: 12 matrix.search.reissue_retrieval_only: 12 ───────────── Total: 111 Browse: http://localhost:3001 → Traces → "multi_coord_stress run" Each phase is a collapsible section showing per-call timing and input/output JSON. Operators can drill into any single retrieval to see exactly what query was issued and what came back. All other metrics held: diversity 0.026, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3 at top-1 (two-tier index), 200-worker swap Jaccard 0.000. This is the FULL TEST J asked for — every action in the run visible in Langfuse, full input/output drilldown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:43:32 -05:00
root	08a086779b	multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 Runs #003-#009 surfaced the same finding: fresh workers added mid-run to the main 'workers' vectord index (5K items) reliably absorbed (HTTP 200) but failed to surface in semantic queries even with content-matching prompts. Distances on the verify queries sat at 0.25-0.65 against existing workers; fresh items were beyond top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER on existing items, pushing fresh items further out of reach. Root cause: coder/hnsw incremental adds to a populated graph land in poorly-connected regions and disappear from search traversal. Known property of HNSW post-build adds; not a bug. Fix: two-tier index pattern (canonical NRT search architecture). Fresh content goes to a small "hot" corpus (fresh_workers); main queries include it in the corpora list and merge results. Hot corpus has no recall crowding because it's tiny; periodic batch job (post- G3) merges it into the main index. Implementation: - ensureFreshIndex(hc, gw, name, dim) — idempotent POST /v1/vectors/index. 409 from re-create treated as "already there." - ingestFreshWorker now takes idx parameter so callers can target fresh_workers instead of workers. - multi_coord_stress phase 1b creates fresh_workers index + ingests 3 fresh workers there + searches verifyCorpora=[workers, ethereal_workers, fresh_workers]. Run #010 result: fresh-001 (Senior tower crane rigger NCCCO Chicago) top-1: fresh-001 from fresh_workers, distance 0.143 fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis) top-1: fresh-002 from fresh_workers, distance 0.146 fresh-003 (FAA Part 107 drone surveyor Chicago) top-1: fresh-003 from fresh_workers, distance 0.129 3/3 fresh workers surface at top-1 — the absorption-but-not- findable issue from runs #003-#009 is closed. All other metrics held: diversity 0.007, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000, inbox burst all 6 events accepted + traced to Langfuse. This is the final structural fix for the multi-coord stress suite. Phase 3 is feature-complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:31:45 -05:00
root	7e6431e4fd	langfuse: Go-side client + Phase 1c instrumentation The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs); this commit lands Go-side parity so the multi-coord stress harness can emit traces visible at http://localhost:3001. internal/langfuse/client.go: - Minimal Trace + Span + Flush API mirroring what the Rust emitter uses. Auth: Basic over public_key:secret_key. - Best-effort posture: errors are slog.Warn'd, never block calling paths. Same fail-open as observerd's persistor (ADR-005 Decision 5.1) — observability is a witness, not a gate. - Events buffered until 50, then auto-flushed; explicit Flush() at process exit. - Each Trace/Span returns its id so callers can build hierarchies. multi_coord_stress driver wiring: - New --langfuse-env flag (default /etc/lakehouse/langfuse.env). Empty / missing / unparseable file → skip tracing with a logged warning; run still proceeds. - Phase 1c (inbox burst) now emits one parent trace + 4 spans per inbox event: 1. observerd.inbox.record (post to /v1/observer/inbox) 2. llm.parse_demand (qwen2.5 → structured fields) 3. matrix.search (parsed query → top-K) 4. llm.judge_top1 (rate top-1 vs original body) Each span carries input/output JSON + start/end times so the Langfuse UI shows a full waterfall per event. Run #009 result: Trace landed: "multi_coord_stress phase 1c inbox burst" Observations attached: 24 (= 6 events × 4 spans) Tags: stress, phase-1c, inbox Browseable at http://localhost:3001 by tag query. Other harness metrics: diversity 0.016, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4 — all unchanged by the tracing addition (best-effort post in parallel). Phase 1c is the proof-of-concept; future commits can wrap other phases (baseline / merge / handover / split) in traces too. Once that's done, the entire stress run becomes scrubbable in Langfuse without grepping the events JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:25:03 -05:00
root	ce940f4a14	multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:16:49 -05:00
root	186d209aae	multi_coord_stress: LLM-parsed inbox demands (qwen2.5) Replaced the hard-coded DemandQuery on inbox events with an actual LLM call: each email/SMS body is parsed by qwen2.5 (format=json, schema-anchored) into structured {role, count, location, certs, skills, shift}. The driver then composes a query string from those fields and runs matrix.search. This is the real-product flow that the Phase 3 stress test was asking for: real bodies → real LLM parsing → real search. Before this commit, the DemandQuery was my hand-crafted string, which made the inbox phase trivial. Run #007 result vs #006 (same bodies, parser swapped): All 6 inbox events parsed cleanly — qwen2.5 nailed: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required." → {role:"forklift operator", count:50, location:"Cleveland, OH", certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"} Other 5 similarly faithful (indy stayed as "indy", count defaulted to 1 when unspecified, no hallucinated fields). LLM-parsed queries produced TIGHTER matches than hard-coded: Demand #006 dist #007 dist Δ Crane Chicago 0.499 0.093 -82% Drone Chicago 0.707 0.073 -90% Bilingual safety 0.240 0.048 -80% Forklift Cleveland 0.330 0.273 -17% Production Indy 0.260 0.399 +53% Warehouse Milwaukee 0.458 0.420 -8% Three matches landed at distance < 0.10 — verbatim-replay-tight territory. Structured queries embed sharper than conversational hand-crafted strings. Other metrics unchanged: diversity 0.000, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4. Tradeoff worth flagging: the drone-Chicago case dropped from distance 0.71 (clear "we don't have one") to 0.07 (confident match returned). The OOD honesty signal weakens when LLM-parsed structure makes any closest-neighbor look tight. Future Phase 4 work: judge re-rates the top match before surfacing, so coordinators see "your demand was for X but the closest match scored 2/5" rather than just the worker ID + distance. Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5). Production would amortize via a small dedicated parser model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:51:19 -05:00
root	e7fc63b216	observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events) Phase 3 ask: real-world inbox-style event injection during the stress test. Coordinators in production receive emails + SMS that trigger contract responses; the substrate has to RECORD these signals AND react with a search using the embedded demand. This commit lands the endpoint and exercises it end-to-end in the stress harness. observerd surface: - New POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp with Source=SourceInbox. Type must be email\|sms; body required; priority defaults to medium. The handler ONLY records — downstream triggers (search, ingest, etc.) are the caller's concern, recorded separately. Keeps the witness role pure. - New observer.SourceInbox = "inbox" alongside SourceMCP / SourceScenario / SourceWorkflow. - Three contract tests on the new route (happy path / bad type / empty body), router-mount test extended, all green. Stress harness phase 1c (Hour 9): - 6 inbox events fire in priority order (urgent → high → medium): 2 urgent emails (forklift Cleveland, production Indianapolis) 1 high email (crane Chicago) 1 high sms (bilingual safety Indianapolis) 1 medium sms (drone Chicago) 1 medium email (warehouse Milwaukee FYI) - Each event: 1. POSTs to /v1/observer/inbox (recorded by observerd) 2. Triggers matrix.search using a parsed demand (the demand extraction is hard-coded for now; production needs a small LLM to parse from body) 3. Captures both as events in the run JSON Run #006 result (with v2-moe embedder + all phases including inbox): Diversity: Same-role-across-contracts Jaccard = 0.000 (n=9) Different-roles-same-contract Jaccard = 0.046 (n=18) Determinism: 1.000 Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) Inbox burst: 6/6 events accepted by observerd (200 status, all recorded) 6/6 triggered searches produced distinct top-1 worker IDs distance distribution: 0.24 (Indy production) → 0.71 (Chicago drone surveyor — honest stretch since drones aren't in the 5K-worker corpus, system surfaces closest neighbor at high distance rather than fabricating) The drone-Chicago case is the architectural-honesty signal: when the demand asks for a specialist NOT in the roster, the system returns the closest semantic neighbor with a distance that flags "this is a stretch." Coordinators reading distances see "we don't have a great match here" rather than a confident wrong answer. Total events captured: 67 (was 61 pre-inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:34:36 -05:00
root	4da32ad102	embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) Local Ollama has three embedding models loaded: nomic-embed-text:latest 137M 768d (previous default) nomic-embed-text-v2-moe:latest 475M 768d (this commit's default) qwen3-embedding:latest 7.6B 4096d (would require dim change) v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE architecture. Workers index doesn't need rebuilding, just future ingests embed with the stronger model. Run #005 result on the multi-coord stress suite: Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9) → MoE is more discriminating: zero worker overlap across Milwaukee / Indianapolis / Chicago for shared role names. The geo + cert + skill context fully separates worker pools. Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff) Determinism: 1.000 (unchanged) Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) 200-worker swap: Jaccard 0.000 (unchanged — still perfect) Fresh-resume verify: STILL doesn't surface fresh workers in top-8. With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39) — the embedder is MORE discriminating, but the fresh worker's vector still doesn't outrank the 8th-best existing worker. Now suspect of being an HNSW post-build add issue (coder/hnsw incremental adds can land in hard-to-reach graph regions, not an embedder problem). Better embedder didn't fix it; needs a different strategy: full index rebuild after fresh adds, or explicit playbook-layer score boost for fresh workers, or hybrid (keyword + semantic) retrieval. Phase 3 investigation. Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s). Acceptable for the quality jump on diversity. Real production with incremental ingest won't pay this once-per-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:26:52 -05:00
root	84a32f0d29	multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap Three Phase 2 additions land in this commit: 1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific worker IDs out of results post-retrieval, AND skips them at the playbook boost+inject step (so excluded answers can't sneak back via Shape B). Real-world driver: coordinator placed N workers, client asks for replacements, system needs alternatives, not the same N. Threaded through retrieve.go after merge but before metadata filter so excluded IDs don't waste post-filter top-K slots. 2. New harness phase 2b: 200-worker swap simulation. Captures the top-K from alpha's warehouse query, then re-issues with exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether the substrate finds genuine alternatives. 3. New harness phase 1b: fresh-resume mid-run injection. Three new workers ingested via /v1/embed + /v1/vectors/index/workers/add, then verified findable via semantic queries matching resume content. Plus Hour labels on every event (operational narrative: 0/6/12/18/ 24/30/36/42/48) and a refactor of captureEvent to take hour as a param. Run #003 + #004 results (5K workers + 10K ethereal): Diversity (#004): Same-role-across-contracts Jaccard = 0.080 (n=9) Different-roles-same-contract Jaccard = 0.013 (n=18) Determinism: 1.000 (#004 unchanged) Verbatim handover: 4/4 = 100% Paraphrase handover: 4/4 = 100% Phase 2b — 200-worker swap (Jaccard 0.000): 8 originally-placed workers fully replaced by 8 alternatives. ExcludeIDs substrate change works end-to-end — boost AND inject both honor the exclusion, so excluded workers don't return via the playbook either. Phase 1b — fresh-resume injection: REAL PRODUCT FINDING. Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add calls at 200 status, 3 vectors persisted. But none of the 3 fresh workers surfaced in top-8 even with semantic queries matching their resume content (e.g. "Senior tower crane rigger NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12 years tower-crane signaling..." NCCCO + Chicago). Top-1 came from existing workers at distance ~0.25; fresh workers' distances must be > 0.25, pushing them past rank 8. Cause: dense retrieval at 5000+ workers means many existing profiles cluster near any specific query in cosine space; nomic-embed-text-v2 (137M) introduces enough noise that a fresh worker doesn't reliably outrank them just because the text content overlaps. Workarounds (Phase 3 work): (a) hybrid retrieval (keyword + semantic), (b) playbook-layer score boost for fresh adds, (c) larger embedder. Documented in run #004 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:19:29 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00
root	b13b5cd7a1	playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14% The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:42:04 -05:00
root	154a72ea5e	matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery The v0 boost-only stance documented in internal/matrix/playbook.go:22-27 ("the boost only re-ranks results that ALREADY surfaced from the regular retrieval") couldn't promote recorded answers that dropped out of a paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2 paraphrase recoveries because the recorded answers weren't in regular retrieval at all (rank=-1). Shape B: when warm-pass retrieval doesn't surface a playbook hit's answer, inject a synthetic Result for it directly. Distance = playbook_hit_distance × BoostFactor — same formula as the boost path so injections land in comparable distance space. Caller re-sorts + truncates after both boost and inject have run. Result on playbook_lift_003 (Shape B + paraphrase pass): Verbatim discovery 6 Verbatim lift 2 / 6 Paraphrase top-1 6 / 6 Paraphrase any-rank in K 6 / 6 Mean Δ top-1 distance -0.1637 (warm closer than cold) Every paraphrase the judge generated landed the v1-recorded answer at top-1 of the new query's results. The learning property holds — cosine on embed(paraphrase) finds the recorded query's vector within DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer. Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates recorded answers across queries. w-4435 (Q2's recording) appears as warm top-1 for several other queries because their embeddings are within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This is a feature, not a bug — the matrix layer's purpose is to share knowledge across queries — but the lift metric only counts "warm top-1 == cold judge best," so cross-pollinated lifts don't register. A v3 metric would re-judge warm pass to measure true judge improvement. Tests: - TestInjectPlaybookMisses_AddsMissingAnswers — primary claim - TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject - TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer - TestInjectPlaybookMisses_EmptyHits — fast-path no-op Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int silently dropped rank=0 (top-1, the WANTED value) from JSON, making the v003 report show "null" instead of "0" for every successful recovery. Pointer keeps nil/rank-0 distinguishable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:06:13 -05:00
root	e9822f025d	playbook_lift v2: paraphrase pass + run #002 finds boost-only limit Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1 recorded a playbook, ask the judge to rephrase the query, then re-query with playbook=true and check whether the recorded answer surfaces in top-K. This is the test the v1 report's caveat #3 explicitly flagged as the actual learning-property gate (not the cheap verbatim case). Implementation: - New flag --with-paraphrase on the driver (default off). - New WITH_PARAPHRASE env in the harness (default 1, on for prod runs). - New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so re-rendering verbatim-only evidence stays clean. - generateParaphrase() calls the same judge model with format=json and a tight schema; temperature=0.5 for variance without domain drift. - Markdown report adds a paraphrase per-query table (only when the pass ran) and an honesty caveat about judge-also-rephrases coupling. Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}): Verbatim lift 2/2 (100% — Q7 + Q13, both stable from v1) Paraphrase top-1 0/2 Paraphrase any-rank in K 0/2 Both paraphrases dropped the recorded answer OUT of top-K entirely (rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs preserved intent ("Hazmat-certified warehouse worker comfortable with cold storage" → "Warehouse worker with Hazmat certification and experience in cold storage"). It's the v0 boost-only stance documented in internal/matrix/playbook.go:22-27: the boost only re-ranks results that ALREADY surfaced from regular retrieval. If paraphrase's cosine retrieval doesn't include the recorded answer in top-K, no boost can promote it. The "Shape B" upgrade mentioned in the playbook.go comment — inject playbook hits directly even when they weren't in the top-K — is what would close this gap. The reality test surfaced exactly the gap the docs warned about. Worth filing as the next product gate. Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2. HNSW insertion order + judge variance both contribute. Stability of Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable signal in the dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:47:41 -05:00
root	6c02c905c8	scrum lift_001: 4 fixes from cross-lineage review Cross-lineage scrum on b2e45f7 produced 1 convergent + 3 single-reviewer findings worth fixing. All apply. 1. (Opus WARN + Qwen INFO convergent) scripts/playbook_lift.sh: replace sleep 2.5 in SQL probe with active polling up to 5s. refresh_every=1s is a lower bound; under load the manifest may not be visible in a fixed sleep, which would 4xx the probe and abort the reality run. 2. (Opus INFO) scripts/playbook_lift.sh: report template glued "env JUDGE_MODEL" + value as "env JUDGE_MODELqwen2.5:latest" with no separator. Replaced two :+/:- substitution chains with a single JUDGE_SOURCE variable computed once at the top of the harness. 3. (Opus INFO) scripts/staffing_workers/main.go: -id-prefix "" silently allowed, defeating the flag's purpose (cross-corpus collision prevent). Now log.Fatal at startup with explicit hint. 4. (Opus WARN) cmd/{pathwayd,observerd}/main_test.go: newTestRouter returned http.Handler then re-cast to chi.Router for chi.Walk. Returning chi.Router directly satisfies http.Handler AND avoids an assertion that would panic if future middleware wraps the router. Dismissed (with rationale): - Kimi INFO hardcoded MinIO endpoint: harness is local-by-design. - Kimi WARN matrixd accepts 502/500: documented; real retriever needs real upstreams the test doesn't spin up. - Qwen INFO queryd string.Contains: brittle but very low risk; restating through typed-error path would couple without adding signal. go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. Verdicts at reports/scrum/_evidence/2026-04-30/verdicts/lift_001_*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:27:24 -05:00
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00
root	740eb0d00c	scrum_review: switch curl to stdin so large diffs don't blow argv Phase 4-bundle review (128KB diff) hit "Argument list too long" when curl --data was passed the body as a literal arg. Pipe via stdin with --data-binary @- instead. Lifts the practical bundle size from ~30KB to whatever fits in process memory. Caught while running the harness scrum on golangLAKEHOUSE today — the bigger Phase A+B harness diff (4566 lines) tripped it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 02:46:52 -05:00
root	e4ee0029c0	scrum_review.sh: reusable 3-lineage cross-review driver Bash driver wrapping /v1/chat for Opus + Kimi + Qwen3-coder review runs. Used today to scrum the 4-phase wave (1,624 LoC of chatd + config-refactor + Rust cleanup) and caught 2 BLOCKs + 2 WARNs. Usage: ./scripts/scrum_review.sh <bundle.diff> <bundle_label> Output: reports/scrum/_evidence/<DATE>/verdicts/<bundle>_<reviewer>.md verbatim, per the evidence-only convention. Per-reviewer latency + token counts captured in the report header. System prompt enforces the BLOCK/WARN/INFO + WHERE/WHAT/WHY shape per feedback_cross_lineage_review.md — leads with verdict, no preamble (Kimi tends to spend tokens thinking otherwise). Reviewer fleet matches project_golang_lakehouse.md "Scrum routing": - opencode/claude-opus-4-7 - openrouter/moonshotai/kimi-k2-0905 - openrouter/qwen/qwen3-coder This is the first dogfood of chatd as the scrum vehicle — eats its own /v1/chat dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:29:36 -05:00
root	0efc7363c5	scrum 2026-04-30: 4 real fixes + 2 INFOs from cross-lineage review 3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups. Verbatim verdicts + disposition table at: reports/scrum/_evidence/2026-04-30/ B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API: collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg (envVar, envFilePath). Pre-fix every chatd caller passed the env var name twice; if operator renamed *_key_env in lakehouse.toml while keeping the canonical KEY= line in the .env file, fallback silently missed. B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe: drop the synthesize-then-Resolve probe; look up by name directly via Registry.Available(name). Prior probe synthesized "<name>/probe" model strings and routed through Resolve, fragile to any future routing rule (e.g. cloud-suffix special case). B-3 (BLOCK Opus single — verified by trace + end-to-end probe) — OllamaCloud.Chat StripPrefix used "cloud" but registry routes "ollama_cloud/<m>". Result: upstream got the prefixed model name and 400'd. Smoke missed it because chatd_smoke runs without ollama_cloud registered. Now strips the right prefix; new TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips cleanly through the real ollama.com endpoint. B-4 (WARN Opus single) — Ollama finishReason: read done_reason field instead of inferring from done bool alone. Newer Ollama reports done=true with done_reason="length" on truncation; the prior code mapped that to "stop" and lost the truncation signal the playbook_lift judge needs to retry. New TestFinishReasonFromOllama_PrefersDoneReason covers the fallback ladder. INFOs: - B-5: replace hand-rolled insertion sort in Registry.Names with sort.Strings (Opus called the "avoid sort import" comment a false economy — correct). - A-1: clarify the playbook_lift.sh comment around -judge "" arg passing (Opus noted the comment said "env priority" but didn't reflect that the empty arg also passes through the Go driver's resolution chain). False positives dismissed (3, documented in disposition.md): - Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS correct per design — model excluded from weak list = strong = downgrade) - Qwen: nil-deref claim (defensive code already handles nil) - Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the public hub but local install has it) just verify: PASS. chatd_smoke 6/6 PASS. New regression tests: 3 (B-2, B-3, B-4 each get a focused test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:28:08 -05:00
root	05273ac06b	phase 4: chatd — multi-provider LLM dispatcher (ollama / cloud / openrouter / opencode / kimi) new cmd/chatd on :3220 routes /v1/chat to the right provider based on model-name prefix or :cloud suffix. closes the architectural gap named in lakehouse.toml [models]: tiers map to model IDs, but until phase 4 there was no service that could actually CALL those models from go. routing rules (registry.Resolve): ollama/<m> → local Ollama (prefix stripped) ollama_cloud/<m> → Ollama Cloud <m>:cloud → Ollama Cloud (suffix variant — kimi-k2.6:cloud) openrouter/<v>/<m> → OpenRouter (prefix stripped, OpenAI-compat) opencode/<m> → OpenCode unified Zen+Go kimi/<m> → Kimi For Coding (api.kimi.com/coding/v1) bare names → local Ollama (default) provider implementations: - internal/chat/types.go Provider interface, Request/Response, errors - internal/chat/registry.go prefix + :cloud suffix dispatch - internal/chat/ollama.go local Ollama via /api/chat (think=false default) - internal/chat/ollama_cloud.go Ollama Cloud via /api/generate (Bearer auth) - internal/chat/openai_compat.go shared OpenAI Chat Completions for the OpenRouter/OpenCode/Kimi family - internal/chat/builder.go BuildRegistry from BuilderInput; ResolveKey reads env then .env file fallback config: - ChatdConfig in internal/shared/config.go with bind, ollama_url, per-provider key env names + .env fallback paths, timeout - Gateway gains chatd_url + /v1/chat + /v1/chat/* routes - lakehouse.toml [chatd] block with /etc/lakehouse/<provider>.env defaults tests (19 in internal/chat): - registry: prefix + :cloud + errors + telemetry + provider listing - ollama: happy path + prefix strip + format=json + 500 mapping + flatten_messages - openai_compat: happy path + format=json + 429 mapping + zero-choices think=false default in ollama + ollama_cloud — local hot path skips reasoning, low-budget callers (the playbook_lift judge at max_tokens=10) get direct answers instead of empty content + done_reason=length. proven via chatd_smoke acceptance. acceptance gate: scripts/chatd_smoke.sh — 6/6 PASS: 1. /v1/chat/providers lists exactly registered providers (1 in dev mode) 2. bare model → ollama default with content + token counts + latency 3. explicit ollama/<m> → prefix stripped at upstream 4. <m>:cloud without ollama_cloud registered → 404 (no silent fall-through) 5. unknown/<m> → falls through to default → upstream 502 (no prefix rewrite) 6. missing model field → 400 just verify: PASS (vet + 30 packages × short tests + 9 smokes). chatd_smoke is a domain smoke (not in just verify, mirrors matrix / observer / pathway pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:08:29 -05:00
root	848cbf5fef	phase 3: playbook_lift harness reads judge from config migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:57:28 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00
root	c7e3124208	§3.8 second slice: real modes wired (matrix.relevance/downgrade/search, distillation.score, drift.scorer) Lands the workflow.Mode adapters for the §3.4 components + the distillation scorer + drift quantifier. Workflows can now compose real measurement capabilities; the substrate's parallel capabilities become composable Lego bricks (per the prior commit's closing insight). Modes registered (in observerd's registerBuiltinModes): Pure-function wrappers (no I/O): - matrix.relevance → matrix.FilterChunks - matrix.downgrade → matrix.MaybeDowngrade - distillation.score → distillation.ScoreRecord - drift.scorer → drift.ComputeScorerDrift HTTP-backed: - matrix.search → POST matrixd /matrix/search (registered only when matrixd_url is set) Fixture (kept from §3.8 first slice): - fixture.echo, fixture.upper internal/workflow/modes.go: Each mode follows the same glue pattern: marshal generic input through a typed struct (free schema validation + clear error messages), call the underlying capability, return a generic output map. Roundtrip-via-JSON gives us schema validation without writing custom field-by-field coercion. internal/workflow/modes_test.go (10 tests, all PASS): - matrix.relevance filters adjacency pollution (Connector kept, catalogd::Registry dropped — same headline as the relevance smoke, run through the workflow mode) - matrix.downgrade flips lakehouse→isolation on strong model; keeps lakehouse on weak (qwen3.5:latest); errors on missing fields - distillation.score rates scrum_review attempt_1 as accepted; rejects empty record - drift.scorer reports zero drift on matched inputs; errors on empty inputs slice - matrix.search HTTP flow round-trips through httptest fake matrixd; non-OK status surfaces a clear error scripts/workflow_smoke.sh (5 assertions PASS, was 4): New assertion #5: real-mode chain matrix.downgrade (lakehouse + grok-4.1-fast → isolation) → distillation.score (scrum_review attempt_1 → accepted) Proves §3.4 components compose through the workflow runner with no fixture intermediation. Both nodes ran successfully, runner recorded provenance, status=succeeded. Mode listing assertion now expects 7 modes (5 real + 2 fixture) instead of just the fixtures. 17-smoke regression all green. SPEC §3.8 acceptance gate G3.8.D ("Mode catalog dispatches matrix.search invocation to the matrixd backend without going through HTTP") still pending — current path goes through HTTP for matrix.search, which is the cleaner service- mesh shape but slower than direct in-process. In-process dispatch when matrixd is co-resident is a future optimization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:39:26 -05:00
root	e30da6e5aa	§3.8 first slice: workflow runner skeleton + DAG executor + observerd integration Lands the structural piece of SPEC §3.8 (Observer-KB workflow runner) documented in 97dd3f8: types + DAG runner + reference substitution + provenance recording into observerd. Real-mode integrations (matrix.search, distillation.score, drift.scorer, llm.chat) come in follow-up commits — this commit proves the mechanics. internal/workflow/types.go: - Workflow / Node / NodeResult / RunResult types matching Archon's YAML shape so existing workflows (e.g. lakehouse-architect-review.yaml) load directly. Optional `mode` field added — implicit fall-back is "llm.chat" matching Archon's convention. - Mode signature: func(Context, map[string]any) (map[string]any, error) - 4 sentinel errors: ErrCycle, ErrMissingDep, ErrUnknownMode, ErrDuplicateNodeID, ErrUnresolvedRef - Validate enforces structural invariants: unique IDs, every depends_on resolves, no cycles internal/workflow/runner.go: - Kahn's-algorithm topological sort, stable for declaration-order ties (deterministic execution + JSON output across runs) - Reference substitution: $node_id.output.key.path resolves through nested maps; $node_id alone resolves to the whole output map - Skip cascade: a node whose dependency failed/skipped is skipped with explicit "upstream node X failed" error in NodeResult, never silently dropped - Per-node provenance: NodeResult.StartedAt + DurationMs captured for every execution - Mode pre-validation: every node's mode checked against registry BEFORE any node runs — typo catches in 5ms not after 6 nodes internal/workflow/runner_test.go (14 tests, all PASS): - Validate: missing name, no nodes, duplicate IDs, missing deps, cycles - Run: single node, 3-node DAG with chained $-refs (shape→weakness→improvement), failed-node skip cascade with independent siblings still running, unknown-mode abort, unresolved-reference error, implicit llm.chat fallback, provenance fields populated, inputs (not just prompt) honor $-refs, topological-sort stability for ties cmd/observerd extended: - POST /observer/workflow/run executes a workflow, records each node's execution as an ObservedOp (source="workflow"), returns the full RunResult - GET /observer/workflow/modes lists the registered mode names - registerBuiltinModes wires fixture.echo + fixture.upper for v0; real modes register here in follow-up commits scripts/workflow_smoke.sh (4 assertions PASS): - GET /modes lists fixture.echo + fixture.upper - 3-node DAG executes: shape (uppercase "hello world") → weakness (sees "HELLO WORLD" via $shape.output.upper ref) → improvement (sees "HELLO WORLD" propagated through 2-hop $weakness.output.prompt) - /observer/stats shows by_source.workflow == 3 (one per node) and total == 3 — provenance lands as expected - Unknown mode → 400 with "unknown mode" in error body 17-smoke regression all green. Acceptance gates G3.8.A (Archon-shape workflow loads + executes topologically) + G3.8.B (per-node ObservedOps) + G3.8.C ($prior_node.output ref resolves, error on missing ref) all satisfied. G3.8.D (in-process matrix.search dispatch) deferred until a real mode is wired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:34:30 -05:00
root	bc9ab93afe	H: observerd — autonomous-iteration witness loop (SPEC §2 port) Port of the load-bearing pieces of mcp-server/observer.ts (Rust system, 852 lines TS) per SPEC §2's named target. Implements PRD loop 3 ("Observer loop — watches each run, refines configs"). Routes (all under /v1/observer/* via gateway): GET /observer/health — liveness GET /observer/stats — total / successes / failures / by_source / recent_scenario_ops (matches Rust JSON shape exactly) POST /observer/event — record one ObservedOp; auto-defaults timestamp + source, validates required fields (endpoint), persists to JSONL, appends to ring buffer Architecture: - internal/observer/types.go — ObservedOp model + Source taxonomy (mcp / scenario / langfuse / overseer_correction). Mirrors the Rust shape so JSON round-trips during cutover. - internal/observer/store.go — Store + Persistor. Ring buffer cap matches Rust's 2000; recent_scenarios cap matches Rust's 10. Same persist-then-apply order as pathwayd; same corruption- tolerant replay (skip malformed lines + warn). - cmd/observerd — :3219 HTTP service, fronted by gateway as /v1/observer/*. - lakehouse.toml + DefaultConfig — [observerd] block matches the pathwayd pattern (Bind + PersistPath; empty path = ephemeral). Tests + smoke (all PASS): - 7 unit tests in store_test.go: validation, default fields, stats aggregation, recent-scenarios cap + ordering, ring-buffer rollover at cap, JSONL round-trip persistence, corruption- tolerant replay (1 valid + 1 corrupt + 1 valid → 2 applied) - scripts/observer_smoke.sh: 4 assertions through gateway — record 5 events (3 ok / 2 fail across 2 sources), stats aggregates correctly, empty-endpoint→400, kill+restart preserves via JSONL replay (5 ops, 3 ok, 2 err survive) Deferred (named in package + cmd doc, not in this commit): - POST /observer/review (cloud-LLM hand-review fall-back). The heuristic-only path could land cheaply but the productized cloud path (qwen3-coder fall-back) is multi-day port. - Background loops: analyzeErrors, consolidatePlaybooks, tailOverseerCorrections (read overseer_corrections.jsonl into the ring buffer once per cycle). - escalateFailureClusterToLLMTeam (failure clustering trigger that posts to LLM Team's /api/run with code_review mode). /relevance is NOT duplicated — already ported in 9588bd8 to internal/matrix/relevance.go (component 3 of SPEC §3.4). 16-smoke regression all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade, playbook, observer). 13 binaries now: gateway, storaged, catalogd, ingestd, queryd, vectord, embedd, pathwayd, matrixd, observerd, mcpd, fake_ollama (plus catalogd-only test build). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:18:02 -05:00
root	6392772f41	C: bulk playbook record — operational rating wiring POST /v1/matrix/playbooks/bulk accepts an array of playbook entries and records each independently — failures per-entry don't abort the batch. Designed for two operational use cases: 1. Backfilling historical placement data into the playbook substrate (the Rust system has 4,701 fill operations recorded with embeddings; that data deserves to feed the Go learning loop without a 4,701-call procedural script). 2. Batched click-tracking from a session's worth of coordinator interactions, posted once at idle rather than per-click. Per-entry response shape: {index, playbook_id} on success or {index, error} on failure. Caller can inspect failures without diffing. Smoke (scripts/playbook_smoke.sh, new assertion #4): Bulk POST 3 entries: 2 valid (alpha→widget-a, bravo→widget-b) + 1 invalid (empty query_text). Verifies recorded=2, failed=1, the 2 valid ones get playbook_ids back, and the invalid one surfaces its validation error in-line. Single-record /matrix/playbooks/record from 06e7152 still works unchanged; bulk is additive. The corpus field can be set per- entry or once at the batch level (entry-level wins on collision). Per the small-model autonomous pipeline framing: this is the "the playbook gets denser with each iteration" mechanism. Click tracking → bulk POST → playbook entries → future similar queries get those answers boosted via the existing /matrix/search use_playbook path. The learning loop now has both inflows wired (single + bulk) — what remains is the demo UI shim that calls /feedback on result interaction (deferred — no Go demo UI yet). 15-smoke regression all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:10:13 -05:00
root	b199093d1f	B: matrix metadata filter — post-retrieval structured gate Addresses the reality-test gap surfaced by the candidates and multi-corpus e2e runs (0d1553c, a97881d): semantic-only retrieval can't gate by status / state / availability. SearchRequest now takes an optional MetadataFilter map; results whose metadata doesn't match every key are dropped before top-K truncation. Filter value semantics: string\|number\|bool → exact equality (JSON-canonical, so 1 ≡ 1.0) []any → OR within key (any element matching wins) AND across keys: every filter key must match. Missing key in metadata = drop. Malformed metadata = drop. Filter absent or empty = pass through (zero overhead). The response now reports MetadataFilterDropped so callers can see how aggressive the filter was without re-querying. Caveat (also captured in code comment): this is POST-retrieval, not PRE-filtering via SQL. Aggressive filters can shrink the result set below K; caller should bump PerCorpusK to compensate. A queryd- backed pre-filter is a future commit; this lands the user-visible fix today. Tests: - 7 unit tests (internal/matrix/filter_test.go) covering: nil/ empty filter pass-through, missing-metadata always-fails, single-value exact match (incl. numeric 5 ≡ 5.0), AND across keys, OR within list, bool match, malformed JSON metadata - matrix_smoke.sh: new assertion #7 — filter label∈{"a near","b near"} drops the 4 mid/far entries from the 6-entry pool, keeping exactly 2 (one per corpus, both with the matching label). Dropped count surfaces in the response. 15-smoke regression all green. vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:08:56 -05:00
root	7f42089521	D: embed-text iteration — clean negative finding (3 variants tested) Workers driver embed text reverted to V0 after testing 3 variants on the "Forklift operator with OSHA-30 certification, warehouse experience" reality-test query against 5000 workers (which contains 569 actual Forklift Operators per the 31b4088 probe). V0 (current, restored): "Worker role: <role>. Skills: ... Certifications: ... <resume_text>" → 6 workers in top-8, 0 Forklift Ops, top distance 0.327, top role "Production Worker" V4a (role-doubled): "<role>. <role> with <skills>. ..." drop archetype + resume_text → 6 workers in top-8, 0 Forklift Ops, top distance 0.254, top role "Production Worker" V4b (resume-only): just the resume_text natural-language sentence, no structured prefix → 4 workers in top-8 (WORSE mix — software-engineer candidates filled the displaced slots), 0 Forklift Ops, top distance 0.379 Conclusion: all three variants surface Production Workers / Machine Operators / Line Leads ABOVE Forklift Operators for this query. The 569 actual Forklift Operators in the 5000-row sample don't appear in any top-8. Embed-text design isn't the bottleneck — nomic-embed-text 137M's geometry doesn't separate "Forklift Operator" from "Production Worker" / "Machine Operator" / "Line Lead" in this query's neighborhood. Real fixes belong elsewhere: - Hybrid SQL+semantic (B): pre-filter by role/certs via queryd before semantic ranking. Addresses the gap directly. - Different embedding model: mxbai-embed-large or a staffing- fine-tuned model. Costs an Ollama model swap + re-embedding. - Playbook boost (component 5, already shipped): record successful Forklift placements; future queries surface those workers via similarity. Compounds with use. V0 restored because it has the best worker/candidate mix in top-8 (6 vs 4 in V4b), preserving the multi-corpus reality-test signal quality even if the role match is imperfect. Comments updated to record the experiment so future sessions don't relitigate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:58:39 -05:00
root	a730fc2016	scrum fixes: 4 real findings landed, 4 false positives dismissed Cross-lineage scrum review on the 12 commits of this session (afbb506..06e7152) via Rust gateway :3100 with Opus + Kimi + Qwen3-coder. Results: Real findings landed: 1. Opus BLOCK — vectord BatchAdd intra-batch duplicates panic coder/hnsw's "node not added" length-invariant. Fixed with last-write-wins dedup inside BatchAdd before the pre-pass. Regression test TestBatchAdd_IntraBatchDedup added. 2. Opus + Kimi convergent WARN — strings.Contains(err.Error(), "status 404") was brittle string-matching to detect cold- start playbook state. Fixed: ErrCorpusNotFound sentinel returned by searchCorpus on HTTP 404; fetchPlaybookHits uses errors.Is. 3. Opus WARN — corpusingest.Run returned nil on total batch failure, masking broken pipelines as "empty corpora." Fixed: Stats.FailedBatches counter, ErrPartialFailure sentinel returned when nonzero. New regression test TestRun_NonzeroFailedBatchesReturnsError. 4. Opus WARN — dead var _ = io.EOF in staffing_500k/main.go was justified by a fictional comment. Removed. Drivers (staffing_500k, staffing_candidates, staffing_workers) updated to handle ErrPartialFailure gracefully — print warn, keep running queries — rather than fatal'ing on transient hiccups while still surfacing the failure clearly in the output. Documented (no code change): - Opus WARN: matrixd /matrix/downgrade reads LH_FORCE_FULL_ENRICHMENT from process env when body omits it. Comment now explains the opinionated default and points callers wanting deterministic behavior to pass the field explicitly. False positives dismissed (caught and verified, NOT acted on): A. Kimi BLOCK on errors.Is + wrapped error in cmd/matrixd:223. Verified false: Search wraps with %w (fmt.Errorf("%w: %v", ErrEmbed, err)), so errors.Is matches the chain correctly. B. Kimi INFO "BatchAdd has no unit tests." Verified false: batch_bench_test.go has BenchmarkBatchAdd; the new dedup test TestBatchAdd_IntraBatchDedup adds another. C. Opus BLOCK on missing finite/zero-norm pre-validation in cmd/vectord:280-291. Verified false: line 272 already calls vectord.ValidateVector before BatchAdd, so finite + zero- norm IS checked. Pre-validation is exhaustive. D. Opus WARN on relevance.go tokenRe (Opus self-corrected mid-finding when realizing leading char counts toward token length). Qwen3-coder returned NO FINDINGS — known issue with very long diffs through the OpenRouter free tier; lineage rotation worked as designed (Opus + Kimi between them caught everything Qwen would have). 15-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade, playbook). Unit tests all green (corpusingest +1, vectord +1). Per feedback_cross_lineage_review.md: convergent finding #2 (404 detection) is the highest-signal one — both Opus and Kimi flagged it independently. The other Opus findings stand on single-reviewer signal but each one verified against the actual code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:42:39 -05:00
root	06e71520c4	matrix: playbook memory + boost — SPEC §3.4 component 5 of 5 (LEARNING LOOP) Closes SPEC §3.4. The matrix indexer is now a learning meta-index per feedback_meta_index_vision.md — every successful (query → answer) pair recorded via /matrix/playbooks/record boosts that answer for future similar queries. This is the architectural piece that lifts vectord from "static hybrid search" to the meta-index J originally framed in Phase 19 of the Rust system. What's new: - internal/matrix/playbook.go — PlaybookEntry, PlaybookHit, ApplyPlaybookBoost. Pure-function boost math: distance' = distance * (1 - 0.5 * score) Score 0 = no boost (factor 1.0); score 1 = halve distance (factor 0.5). Capped at 0.5 deliberately so a single high- confidence playbook can't dominate the base ranking forever (runaway-feedback-loop guard). - Retriever.Record(entry, corpus) — embeds query_text, ensures playbook corpus exists (idempotent), upserts via deterministic sha256-derived ID (last score wins on re-record of same triple). - Retriever.Search extended with UsePlaybook + PlaybookCorpus + PlaybookTopK + PlaybookMaxDistance. Reuses the query vector — no extra embed call. Missing-corpus 404 = no-op (cold-start state before any Record call), not an error. - POST /v1/matrix/playbooks/record (matrixd) — caller submits {query_text, answer_id, answer_corpus, score, tags?}; gets {playbook_id} back. Storage: a vectord index named "playbook_memory" (configurable per request) with embed(query_text) as the vector and the PlaybookEntry JSON as metadata. Just another corpus — observable from /vectors/index, persistable through G1P, etc. Match key for boost: (AnswerID, AnswerCorpus). Cross-corpus ID collisions don't false-match — verified by TestApplyPlaybookBoost_CorpusAttributionRespected. End-to-end smoke (scripts/playbook_smoke.sh, all assertions PASS): - Baseline search: widget-c at distance 0.6566 (rank 3) - Record playbook: query → widget-c, score=1.0 - Re-search with use_playbook=true: widget-c distance: 0.3283 (rank 2) ratio: 0.5 EXACTLY (matches boost math precisely) playbook_boosted: 1 - widget-c jumped from #3 to #2 — learning loop visible Tests: - 8 unit tests in internal/matrix/playbook_test.go covering Validate, BoostFactor (5 cases), the no-boost identity, the boost-moves-result-up scenario, highest-score wins on duplicate matches, cross-corpus attribution, JSON round-trip, and rejection of empty metadata - scripts/playbook_smoke.sh integration test (3 assertions PASS) 15-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade, playbook). SPEC §3.4 NOW COMPLETE: 5 of 5 components shipped. The matrix indexer's port is done as a substrate; remaining work is operational (rating signal sources, telemetry, eventual structured filtering for staffing data — none in §3.4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:34:24 -05:00
root	31b408882b	multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding Adds WORKERS_LIMIT env override (default 5000) so the e2e can be re-run at different sample sizes. Tiny change; the interesting part is the FINDING that motivated the run. Investigation: a97881d's reality test put zero Forklift Operators in the top-6 for "Forklift operator with OSHA-30 certification, warehouse experience" — instead returned Production Worker / Machine Operator / Assembler. Hypothesis tested: maybe the 5000-row sample didn't contain forklift operators in retrievable density. Result: hypothesis falsified. Direct probe of workers_500k.parquet: All 500K rows → 55,349 Forklift Operators (11.07%) → 150,328 with "forklift" in certs → 74,852 with OSHA-30 specifically First 5K rows → 569 Forklift Operators (11.38%) → distribution matches global, no ordering bias So 569 forklift operators were IN the corpus the matrix indexer searched and STILL didn't surface in top-6. That means the bottleneck isn't sample size — it's nomic-embed-text + our embed-text template ranking "Production Worker" / "Machine Operator" / "Assembler" as semantically nearer to the query than literal "Forklift Operator". The reality test exposed this faithfully. Three real follow-ups, none in scope of this commit: 1. Embed text design — front-loading role + certs (currently "Worker role: <role>" then skills then certs) might dominate retrieval better. Worth A/B-testing. 2. Hybrid SQL+semantic — pre-filter by role/certs via queryd before semantic ranking. Not in SPEC §3.4 today; would address the "available" / "Chicago" gap from the candidates reality test (0d1553c) too. 3. Playbook-memory boost — SPEC §3.4 component 5. When a query "Forklift OSHA-30" was answered with worker w-X in the past, boost w-X's score for similar future queries. The retrieval gap CAN be bridged by the learning loop without changing the base embedder. Commits the env knob; the finding lives in the commit body so future sessions don't re-run the sample-size hypothesis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:32 -05:00
root	a97881d80c	workers corpus + multi-corpus reality test — matrix indexer end-to-end Lands the second real-data corpus (workers_500k) and the first multi-corpus reality test through /v1/matrix/search composing both corpora live. What's new: - scripts/staffing_workers/main.go — parquet driver over workers_500k.parquet, multi-chunk arrow handling (workers parquet has multiple row groups vs candidates' one). Embed text: role + skills + certifications + city + state + archetype + resume_text. IDs prefixed "w-". - scripts/multi_corpus_e2e.sh — first end-to-end test composing both corpora through the matrix indexer. Real-data multi-corpus result (this commit): Query: "Forklift operator with OSHA-30 certification, warehouse experience" Corpora: workers (5000 rows) + candidates (1000 rows) Merged top-8: workers=6, candidates=2 Top hits: w d=0.327 w-4573 Production Worker w d=0.353 w-1726 Machine Operator w d=0.362 w-3806 Production Worker w d=0.366 w-1000 Machine Operator w d=0.374 w-1436 Assembler w d=0.395 w-162 Machine Operator c d=0.440 c-CAND-00727 C#,.NET,Azure c d=0.446 c-CAND-00031 React,TypeScript,Node The matrix indexer correctly chose the right domain — manufacturing/ warehouse roles in workers (correct semantic match for the staffing query) rank ABOVE software-engineer candidates from the candidates corpus. 0.11 gap between the worst worker (0.395) and the best candidate (0.440) — clean distance separation. Compared to the candidates-only e2e run from 0d1553c: candidates-only top: c-CAND-00727 at d=0.4404 multi-corpus top: w-4573 at d=0.3265 (a Production Worker) That's the matrix indexer's whole point made visible: composing domain-distinct corpora surfaces better matches than single-corpus search. Without workers in the search space, the staffing query returned software engineers (wrong domain). With workers, it returns roles in the right ballpark. What's still imperfect (signal for component 5 + future work): - No top-6 worker actually has "Forklift" or "OSHA-30" visible in metadata; "Production Worker" is semantically nearest in this sample. Likely needs a larger workers ingest (5000 from 500K) or skill-keyword boost. - Status/availability still not gated. The staffing-side structured filtering gap from 0d1553c persists; relevance filter (CODE-aware) doesn't address it. Pipeline timings: workers ingest: 5000 rows / 19.2s = 260/sec end-to-end candidates ingest: 1000 rows / 3.1s = 322/sec multi-corpus query (text → embed → 2 parallel vectord → merge): 14ms 14-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:22:16 -05:00
root	3968ec8a7b	matrix: strong-model downgrade gate — SPEC §3.4 component 4 of 5 Pure-Go port of mode.rs::execute's pass5 downgrade gate (Rust 2026-04-26). Adds POST /v1/matrix/downgrade endpoint via matrixd. The gate captures the pass5 finding: composing matrix corpora into codereview_lakehouse on a strong model LOST 5/5 head-to-head reps against matrix-free codereview_isolation on grok-4.1-fast (p=0.031). Strong models have enough native capacity that bug fingerprints + adversarial framing + file content carry them; matrix chunks displace depth-of-analysis. Logic (matches Rust mode.rs:614-632): if mode == codereview_lakehouse && !forced_mode && !LH_FORCE_FULL_ENRICHMENT && !is_weak_model(model) → flip to codereview_isolation, record downgraded_from is_weak_model captures the empirical weak-list: - `:free` suffix or `:free/` infix (OpenRouter free tier) - qwen3.5:latest, qwen3:latest (local last-resort rungs) - everything else → strong by default Tests: - 3 unit tests in internal/matrix/downgrade_test.go: IsWeakModel coverage, MaybeDowngrade truth table (5 rows), forced-mode precedence (forced beats every other bypass) - scripts/downgrade_smoke.sh: 6 assertions through gateway covering all 5 truth-table rows + empty-mode 400 14-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade). SPEC §3.4 progress: 4 of 5 components shipped (corpus builders, multi-corpus retrieve+merge, relevance filter, downgrade gate). Last component is learning-loop integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:17:55 -05:00
root	9588bd82ae	matrix: relevance filter — SPEC §3.4 component 3 of 5 Faithful port of mcp-server/relevance.ts (Rust observer's adjacency- pollution filter). Same 5-signal scoring, same default threshold 0.3. Adds POST /v1/matrix/relevance endpoint via matrixd. Scoring signals (additive, can sign-flip): path_match +1.0 chunk source/doc_id encodes focus.path filename_match +0.6 chunk text mentions focus's filename defined_match +0.6 chunk text mentions focus.defined_symbols token_overlap +0.4 jaccard of non-stopword tokens prefix_match +0.3 chunk source shares first-2-segment prefix import_penalty -0.5 mentions ONLY imported symbols, no defined ones What this does and doesn't do: - DOES filter code-aware corpora (eventually lakehouse_arch_v1, lakehouse_symbols_v1, scrum_findings_v1) — drops chunks about code the focus file IMPORTS rather than DEFINES, the "adjacency pollution" pattern that makes a reviewer LLM hallucinate imported-crate internals as belonging to the focus - DOES NOT meaningfully filter staffing data — the candidates reality test 2026-04-29 had "exact skill match buried at #3" which is a different problem (semantic-only ranking dominated by secondary text). Staffing needs structured filtering (status gates, location gates) that lives outside this package — future work, not in SPEC §3.4 yet Headline smoke assertion: focus = crates/queryd/src/db.go which defines Connector and imports catalogd::Registry. The filter scores: Connector chunk: +0.68 (defined_match fires, kept) Registry chunk: -0.46 (import_only penalty fires, dropped) unrelated junk: 0.00 (no signals, dropped) That's a 1.14-point gap between what we ARE and what we IMPORT — the entire purpose of the filter. Tests: - 9 unit tests in internal/matrix/relevance_test.go covering Tokenize, Jaccard, ExtractDefinedSymbols (Rust + TS), ExtractImportedSymbols, FilePrefix, ScoreRelevance per-signal, FilterChunks threshold splitting, and the headline AdjacencyPollutionScenario - scripts/relevance_smoke.sh integration smoke (3 assertions PASS): adjacency-pollution scenario, empty-chunks 400, threshold honored 13-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:13:22 -05:00
root	0d1553ca88	candidates corpus: first deep-field reality test on real staffing data Lands the second staffing corpus and the first end-to-end reality test through the full Go pipeline: parquet → corpusingest → embedd → vectord → matrixd → gateway. What's new: - scripts/staffing_candidates/main.go — parquet Source over candidates.parquet (1000 rows, 11 cols), single-chunk arrow-go pqarrow read. Embed text: "Candidate skills: <s>. Based in <city>, <state>. <years> years experience. Status: <status>. <first> <last>." IDs prefixed "c-" so multi-corpus merges against workers ("w-") stay unambiguous. - scripts/candidates_e2e.sh — first integration smoke that runs the full stack (storaged + embedd + vectord + matrixd + gateway), ingests via corpusingest, runs a real query through /v1/matrix/search, prints results. Ephemeral mode (vectord persistence disabled via custom toml) so re-runs don't pollute MinIO _vectors/ and break g1p_smoke's "only-one-persisted-index" assertion. Real bug caught + fixed in corpusingest: When LogProgress > 0, the progress goroutine's only exit was ctx.Done(). With context.Background() in the production driver, Run hung forever after the pipeline finished. Added a stopProgress channel that close()s after wg.Wait(). Regression test TestRun_ProgressLoggerExits bounds Run's wall to 2s with LogProgress=50ms. This is the bug the unit tests didn't catch because every prior test set LogProgress: 0. Reality test surfaced it on first real-data run — exactly the hyperfocus-and-find-architectural-weakness property J framed as the reason for the Go pass. End-to-end output (1000 candidates, query "Python AWS Docker engineer in Chicago available now"): populate: scanned=1000 embedded=1000 added=1000 wall=3.5s matrix returned 5 hits in 26ms The result quality is the interesting signal: top-5 had ZERO Chicago candidates, ZERO active-status candidates, and the exact- skill-match (Python,AWS,Docker) ranked #3 not #1. Pipeline works; retrieval quality has real architectural limits (no structured filtering, no relevance gate, semantic-only ranking dominated by secondary signals like "1 year experience" and "engineer"). This motivates SPEC §3.4 components 3 (relevance filter) and eventually structured filtering — exactly the kind of finding the deep field reality tests are supposed to surface before Enterprise cutover. 12-smoke regression sweep all green. 9 corpusingest unit tests including the new regression. vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:06:27 -05:00

1 2

67 Commits