golangLAKEHOUSE

Author	SHA1	Message	Date
root	95f155b017	real_006: distribution-shift test on rows 10-59 of fill_events Methodology fix: gen_real_queries.go gains -offset N flag. Every prior real_NNN test sourced queries from rows 0-9 of fill_events.parquet (default -limit 10), so the substrate's published "8/10 cold-pass top-1 = judge-best" was measured on a memorized slice, not held-out data. real_006 samples 50 fresh rows (offset 10, never seen by the workers or ethereal_workers corpora). Same harness, same local qwen2.5:latest judge, same K=10. ~14 min wall total. Local-only, no cloud calls. Headline findings: - Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's 8/10 (80%) — substrate generalizes at rank level. - Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's 80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated 1; the corpus has gaps for some role-city combos in the v3 slice. - Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2) - Paraphrase recovery: 6/9 → top-1, 9/9 any-rank - Quality regressed: 3/50 — Q43 is the structural one Q43 (Packer at Midway Distribution / Chicago IL) regressed from rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and `playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city) recorded a playbook entry. The regression suggests Q18's recording leaked into Q43 via the warm-pass playbook corpus retrieval surface even though the role gate from real_002 should have blocked it. Three possible paths: extractor failed on one query, gate fires on boost path but not Shape B inject, or cosine drift puts the recorded worker close enough to Q43's embedding that warm-pass retrieval picks it up directly. Diagnosis is the next move. Three same-(client, city) clusters tested: - Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers - Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25 surface same worker w-281 for Assemblers vs Quality Techs at cold- pass), but no playbook bleed - Midway Distribution Chicago IL × 3: Q43 regression as above What this confirms: substrate works on the fresh distribution at the rank level, verbatim lift is real, paraphrase recovery is real. What this falsifies: real_002's role-gate fix is not structurally airtight. The bleed pattern can still fire under conditions the prior tests didn't reach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:54:03 -05:00
root	cca32344f3	reality_test real_005: negation probe — substrate gap is correctly out-of-scope 5 explicit-negation queries ("Need Forklift Operators in Aurora IL, NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.) through the standard playbook_lift harness. Goal: characterize whether the substrate has negation handling or silently treats "NOT X" as "X". Headline: substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" identical to "in Detroit" plus noise — there is no logical-quantifier representation in the embedding space. This is a structural property of dense embeddings, not a substrate bug. Per-query observations: - Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge - Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK because role+city signal pulled non-Beacon worker naturally - Q3 (excluding Cornerstone): unanimous 1/5 across top-10 - Q4 (NOT Detroit-area): all top-10 rated 1-2/5 - Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK The judge IS the safety net: when retrieval can't honor the constraint, the judge refuses to approve any result. That's the honesty signal — `discovery=0` for the run aggregates it. No code change. The architectural answer for production is: - UI surfaces an "exclude" affordance that populates ExcludeIDs (already supported, added in multi-coord stress 200-worker swap) - Coordinators don't type natural-language negation — they click - Substrate's role: surface honesty signal (judge ratings) + don't pretend to honor unparseable constraints Adding NL-negation handling at the substrate level would be product debt — it would let coordinators type sloppier queries that silently fail when the LLM extractor misses a phrasing. Don't ship until production traffic demonstrates demand for it. Findings: reports/reality-tests/real_005_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:06:06 -05:00
root	3263254f1c	reality_test real_003: 40-query paraphrase stress + extractor extension Stress-tests the role gate with 40 queries (10 fill_events rows × 4 styles): need, client_first, looking, shorthand. Each row's role + client + city stays the same; only the surface phrasing changes. real_003 (original extractor) confirmed the shorthand-vs-shorthand failure mode: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query within the same Beacon Freight Detroit cluster. Both record + query had empty role (extractor returns "" for shorthand because there's no separator between role and city), gate disabled, distance check passed, bleed fired. Fix: extended extractRoleFromNeed to handle client_first ("{client} needs N {role} in...") and looking ("Looking for N {role} at...") patterns. Shorthand left intentionally unmatched — "Forklift Operator Detroit" is shape-indistinguishable from "Forklift" + "Operator Detroit" without an LLM extractor or known- cities lookup. real_003b (extended extractor) verifies bleed closed across all 4 styles for this dataset. Forklift Operator queries keep w-2136 (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. scripts/cutover/gen_real_queries.go: added -styles flag with values need\|client_first\|looking\|shorthand\|all (default need preserves real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is the 40-query stress file. scripts/playbook_lift/main_test.go: 10 sub-tests lock the four documented patterns + shorthand limitation + lift-suite-style queries (no clean role, returns empty as expected). Aggregate metrics: - real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108 - real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202 The growth reflects more LEGITIMATE same-role same-cluster transfer firing across styles, not bleed (verified by per-cluster bleed table — Forklift Operator queries unchanged across all 4 styles). Known limitation documented in real_003_findings.md: same-cluster, same-role queries in shorthand still embed close enough that a shorthand recording could bleed onto a different-role shorthand query if both record + query strip role. Closing this requires LLM extraction or known-cities lookup at record + query time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:42:02 -05:00
root	7f2f112e6a	reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed First retrieval probe with non-synthetic query distribution. Pulls N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet (real-shape demand data) and translates each to the natural language a coordinator would type: "Need {count} {role}s in {city} {state} starting at {at} for {client}". Headline: 8/10 cold-pass top-1 = judge-best on real distribution. Substrate works on queries it was never trained for. v2-moe + workers corpus carry the load. Surfaced finding (the real value of running this): same-client+city queries cluster, and Shape A's distance boost bleeds across roles within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1 even though: - Neither query has its own recorded playbook. - Neither warm pass triggers a Shape B inject (boosted=0). - The roles are different staffing categories. Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) for a worker who was approved by the judge for a different role on a different query. Why the lift suite missed it: synthetic queries use 7 disjoint scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand clusters on (client, city). The cluster doesn't exist in the synthetic distribution. Why the judge gate doesn't catch it: the gate (5a3364f) is per-injection at record time. After approval the worker rides Shape A distance boosts on all later same-cluster queries with no second gate call. Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. Cheap; doesn't need new judge calls. Files: - scripts/cutover/gen_real_queries.go: parquet → coordinator NL - tests/reality/real_coord_queries.txt: 10 generated queries - reports/reality-tests/playbook_lift_real_001.md: harness output - reports/reality-tests/real_001_findings.md: the reading Repro: go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:18:40 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00

8 Commits