golangLAKEHOUSE

Author	SHA1	Message	Date
root	b216b7e5b6	fix the other 4: close all OPEN-list items in one wave Substantial wave addressing all 4 prior OPEN items. Three closed in full, one partially (the speculative half deliberately deferred). OPEN #1 — Periodic fresh→main index merge (FULL): - POST /v1/vectors/index/{src}/merge with {dest, clear_source} - Idempotent on re-runs (existing-in-dest items skipped) - internal/vectord/index.go: new Index.IDs() snapshot method + i.ids tracker field as canonical ID set, independent of meta map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta alone missed items added with nil metadata) - 4 cmd-level integration tests (happy path drain+clear, dim mismatch, dest not found, self-merge rejection) + 1 unit test - DecodeIndex backward-compat: old envelopes restore i.ids from meta keys (best effort; new items going forward use the tracker) OPEN #2 — Distillation SFT export (SUBSTRATE): - internal/distillation/sft_export.go ports the load-bearing half: IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/ MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft. - Synthesis (instruction/input/response generation) deferred to a separate wave — too big for this session, but the substrate makes the next wave a port-not-design exercise. - TestSftNever_PinsExpectedSet locks the contamination firewall set: if a future commit adds/removes from SftNever, this test fails — forcing the change through review. - 5 new tests; firewall fires end-to-end through the partial port. OPEN #3 — Distribution drift via PSI (FULL): - internal/drift/drift.go: ComputeDistributionDrift via Population Stability Index. Standard finance/risk metric, well-defined verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). - Equal-width bucketing over combined min/max so neither dist falls outside; epsilon-clamping for empty buckets so log doesn't blow up. Per-bucket breakdown for drilldown. - Pairs with the existing ComputeScorerDrift: scorer drift is categorical, distribution drift is continuous. Different shapes, same package. - 7 new tests covering identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical- safe, bucket-counts-conserved, num-buckets-clamping. OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others deferred): - (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`). Output.PhaseTimings + Output.TotalElapsedMs in JSON. - (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger, would be speculative. Documented as deferred-until-need rather than ignored. Per the project's discipline ("don't add features beyond what the task requires"). OPEN list now empty / steady-state. Future items will land as production triggers fire. Build + vet + tests green; 18 new tests across the 4 closures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:42:11 -05:00
root	356d76b4b0	multi_coord_stress: thread role through matrix retrieve + playbook record Real wire-up gap discovered post-scrum: Demand.Role was already extracted at every call site in multi_coord_stress (44 occurrences, both contract-driven and LLM-parsed inbox-triggered paths), but neither matrixSearch nor playbookRecord accepted role in their signatures. Cross-role gate (real_001..real_004 work) was bypassed for the entire multi-coord harness — recordings and queries went through with empty role, gate fell back to lenient behavior. Fix: - matrixSearchReq gains query_role field - matrixSearch signature: (..., query, role string, ...) - tracedSearch wrapper gains role param + emits it in span input metadata for Langfuse visibility - playbookRecord signature: (..., query, role, ...) — body emits role only when non-empty (preserves backward compat at API) - 14 call sites updated: contract-driven Demand loops → d.Role LLM-parsed inbox path → parsed.Role (qwen2.5 already extracts it) swap path (warehouseDemand) → warehouseDemand.Role reissue path → ev.Role (captured at original event time) fresh-verify (resume snippet, no role concept) → "" Build clean, vet clean, all tests pass. Cross-role gate now fires end-to-end across the multi-coord harness — matches the playbook_lift harness's coverage from the original real_001 fix. This closes the symmetric gap to scripts/playbook_lift's existing wire-through. Both production-shape harnesses now exercise the role gate; future reality tests automatically inherit the protection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:10:49 -05:00
root	6c93a38093	scrum multi_coord_phase3: 4 fixes from cross-lineage review Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines) produced 4 actionable findings, all defensive hardening. 1. (Opus WARN) internal/langfuse/client.go:queue Synchronous Flush at maxBatch threshold blocked the calling goroutine for the full 5s HTTP timeout when Langfuse hiccupped, defeating the "best-effort, never blocks calling path" contract in the package doc. Now fire-and-forget via goroutine. 2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox - Free-form priority string was accepted; "nonsense" passed through unchecked. Now closed enum: urgent\|high\|medium\|low (+ empty defaults to medium). Tested: TestInbox_RejectsBadPriority. - No size cap on body, only emptiness check; multi-MB payloads would bloat observer's ring + JSONL. Now 8 KiB cap returns 413. Tested: TestInbox_RejectsOversizedBody. - Subject/sender/tag concatenated into InputSummary without newline stripping; embedded \n could corrupt JSONL line-based parsers. New sanitizeInboxField strips \r\n + caps at 256 chars before interpolation. 3. (Opus INFO) scripts/multi_coord_stress/main.go Removed dead `must[T]` generic — tracedSearch took over the fail-fast role for matrix searches, so the helper became unused. 4. (Opus INFO) scripts/multi_coord_stress/main.go:Event `JudgeRating int` collapsed "judge errored" and "judge said unrated" both to 0. Changed to *int — nil = errored, 1-5 = verdict. judgeInboxResult still returns 0 on error; caller gates on > 0 before assigning. Dismissed (with rationale): - Opus WARN ExcludeIDs ordering: verified by code read — filter applies after sort + before top-K truncation as documented; no slot waste possible. - Opus INFO 10 prior-run reports contradict #011: those are point-in-time snapshots; intentional history. - Kimi INFO Langfuse error suppression: design intent (best-effort per package doc). - Kimi INFO contract schema validation: defer until contract count grows enough to make hand-edit drift a real risk. - Kimi INFO paraphrase prompt duplicated across lift + multi_coord: defer (lift to internal/paraphrase/ when a third consumer appears). - Qwen HOLD: single-line, no actionable finding. go test ./cmd/observerd ./internal/langfuse all green; multi_coord driver builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:42:07 -05:00
root	5d49967833	multi_coord_stress: full Langfuse coverage — every phase + every call Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept. This commit threads tracing through every phase: baseline / fresh- resume / inbox burst / surge / swap / merge / handover (verbatim + paraphrase) / split / reissue. Each phase is a parent span; each matrix.search / LLM call inside is a child span. Refactor: - One run-level trace is created at driver startup. - New startPhase(name, hour, meta) helper emits a phase span as a child of the run trace; subsequent emitSpan calls nest under it. - New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch with span emission. Every search call site replaced with this so the input/output JSON (query, corpora, k, playbook, exclude_n → top-K ids, top1 distance, boost/inject counts) lands in Langfuse. - Phase 4b's paraphrase generation also emits llm.paraphrase spans. - Phase 1c's existing inline span emission converted to use the new helpers (no more inboxTraceID variable). Run #011 result: trace landed at http://localhost:3001 with 111 observations attached. Span breakdown: phase.* parents: 9 (one per phase that ran) matrix.search.baseline: 10 matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh) observerd.inbox.record: 6 llm.parse_demand: 6 matrix.search.inbox: 6 llm.judge_top1: 6 matrix.search.surge: 12 matrix.search.swap_orig: 1 matrix.search.swap_replace: 1 matrix.search.merge: 6 matrix.search.handover_verbatim: 4 llm.paraphrase: 4 matrix.search.handover_paraphrase: 4 matrix.search.split: 4 matrix.search.reissue: 12 matrix.search.reissue_retrieval_only: 12 ───────────── Total: 111 Browse: http://localhost:3001 → Traces → "multi_coord_stress run" Each phase is a collapsible section showing per-call timing and input/output JSON. Operators can drill into any single retrieval to see exactly what query was issued and what came back. All other metrics held: diversity 0.026, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3 at top-1 (two-tier index), 200-worker swap Jaccard 0.000. This is the FULL TEST J asked for — every action in the run visible in Langfuse, full input/output drilldown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:43:32 -05:00
root	08a086779b	multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 Runs #003-#009 surfaced the same finding: fresh workers added mid-run to the main 'workers' vectord index (5K items) reliably absorbed (HTTP 200) but failed to surface in semantic queries even with content-matching prompts. Distances on the verify queries sat at 0.25-0.65 against existing workers; fresh items were beyond top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER on existing items, pushing fresh items further out of reach. Root cause: coder/hnsw incremental adds to a populated graph land in poorly-connected regions and disappear from search traversal. Known property of HNSW post-build adds; not a bug. Fix: two-tier index pattern (canonical NRT search architecture). Fresh content goes to a small "hot" corpus (fresh_workers); main queries include it in the corpora list and merge results. Hot corpus has no recall crowding because it's tiny; periodic batch job (post- G3) merges it into the main index. Implementation: - ensureFreshIndex(hc, gw, name, dim) — idempotent POST /v1/vectors/index. 409 from re-create treated as "already there." - ingestFreshWorker now takes idx parameter so callers can target fresh_workers instead of workers. - multi_coord_stress phase 1b creates fresh_workers index + ingests 3 fresh workers there + searches verifyCorpora=[workers, ethereal_workers, fresh_workers]. Run #010 result: fresh-001 (Senior tower crane rigger NCCCO Chicago) top-1: fresh-001 from fresh_workers, distance 0.143 fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis) top-1: fresh-002 from fresh_workers, distance 0.146 fresh-003 (FAA Part 107 drone surveyor Chicago) top-1: fresh-003 from fresh_workers, distance 0.129 3/3 fresh workers surface at top-1 — the absorption-but-not- findable issue from runs #003-#009 is closed. All other metrics held: diversity 0.007, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000, inbox burst all 6 events accepted + traced to Langfuse. This is the final structural fix for the multi-coord stress suite. Phase 3 is feature-complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:31:45 -05:00
root	7e6431e4fd	langfuse: Go-side client + Phase 1c instrumentation The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs); this commit lands Go-side parity so the multi-coord stress harness can emit traces visible at http://localhost:3001. internal/langfuse/client.go: - Minimal Trace + Span + Flush API mirroring what the Rust emitter uses. Auth: Basic over public_key:secret_key. - Best-effort posture: errors are slog.Warn'd, never block calling paths. Same fail-open as observerd's persistor (ADR-005 Decision 5.1) — observability is a witness, not a gate. - Events buffered until 50, then auto-flushed; explicit Flush() at process exit. - Each Trace/Span returns its id so callers can build hierarchies. multi_coord_stress driver wiring: - New --langfuse-env flag (default /etc/lakehouse/langfuse.env). Empty / missing / unparseable file → skip tracing with a logged warning; run still proceeds. - Phase 1c (inbox burst) now emits one parent trace + 4 spans per inbox event: 1. observerd.inbox.record (post to /v1/observer/inbox) 2. llm.parse_demand (qwen2.5 → structured fields) 3. matrix.search (parsed query → top-K) 4. llm.judge_top1 (rate top-1 vs original body) Each span carries input/output JSON + start/end times so the Langfuse UI shows a full waterfall per event. Run #009 result: Trace landed: "multi_coord_stress phase 1c inbox burst" Observations attached: 24 (= 6 events × 4 spans) Tags: stress, phase-1c, inbox Browseable at http://localhost:3001 by tag query. Other harness metrics: diversity 0.016, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4 — all unchanged by the tracing addition (best-effort post in parallel). Phase 1c is the proof-of-concept; future commits can wrap other phases (baseline / merge / handover / split) in traces too. Once that's done, the entire stress run becomes scrubbable in Langfuse without grepping the events JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:25:03 -05:00
root	ce940f4a14	multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:16:49 -05:00
root	186d209aae	multi_coord_stress: LLM-parsed inbox demands (qwen2.5) Replaced the hard-coded DemandQuery on inbox events with an actual LLM call: each email/SMS body is parsed by qwen2.5 (format=json, schema-anchored) into structured {role, count, location, certs, skills, shift}. The driver then composes a query string from those fields and runs matrix.search. This is the real-product flow that the Phase 3 stress test was asking for: real bodies → real LLM parsing → real search. Before this commit, the DemandQuery was my hand-crafted string, which made the inbox phase trivial. Run #007 result vs #006 (same bodies, parser swapped): All 6 inbox events parsed cleanly — qwen2.5 nailed: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required." → {role:"forklift operator", count:50, location:"Cleveland, OH", certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"} Other 5 similarly faithful (indy stayed as "indy", count defaulted to 1 when unspecified, no hallucinated fields). LLM-parsed queries produced TIGHTER matches than hard-coded: Demand #006 dist #007 dist Δ Crane Chicago 0.499 0.093 -82% Drone Chicago 0.707 0.073 -90% Bilingual safety 0.240 0.048 -80% Forklift Cleveland 0.330 0.273 -17% Production Indy 0.260 0.399 +53% Warehouse Milwaukee 0.458 0.420 -8% Three matches landed at distance < 0.10 — verbatim-replay-tight territory. Structured queries embed sharper than conversational hand-crafted strings. Other metrics unchanged: diversity 0.000, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4. Tradeoff worth flagging: the drone-Chicago case dropped from distance 0.71 (clear "we don't have one") to 0.07 (confident match returned). The OOD honesty signal weakens when LLM-parsed structure makes any closest-neighbor look tight. Future Phase 4 work: judge re-rates the top match before surfacing, so coordinators see "your demand was for X but the closest match scored 2/5" rather than just the worker ID + distance. Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5). Production would amortize via a small dedicated parser model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:51:19 -05:00
root	e7fc63b216	observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events) Phase 3 ask: real-world inbox-style event injection during the stress test. Coordinators in production receive emails + SMS that trigger contract responses; the substrate has to RECORD these signals AND react with a search using the embedded demand. This commit lands the endpoint and exercises it end-to-end in the stress harness. observerd surface: - New POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp with Source=SourceInbox. Type must be email\|sms; body required; priority defaults to medium. The handler ONLY records — downstream triggers (search, ingest, etc.) are the caller's concern, recorded separately. Keeps the witness role pure. - New observer.SourceInbox = "inbox" alongside SourceMCP / SourceScenario / SourceWorkflow. - Three contract tests on the new route (happy path / bad type / empty body), router-mount test extended, all green. Stress harness phase 1c (Hour 9): - 6 inbox events fire in priority order (urgent → high → medium): 2 urgent emails (forklift Cleveland, production Indianapolis) 1 high email (crane Chicago) 1 high sms (bilingual safety Indianapolis) 1 medium sms (drone Chicago) 1 medium email (warehouse Milwaukee FYI) - Each event: 1. POSTs to /v1/observer/inbox (recorded by observerd) 2. Triggers matrix.search using a parsed demand (the demand extraction is hard-coded for now; production needs a small LLM to parse from body) 3. Captures both as events in the run JSON Run #006 result (with v2-moe embedder + all phases including inbox): Diversity: Same-role-across-contracts Jaccard = 0.000 (n=9) Different-roles-same-contract Jaccard = 0.046 (n=18) Determinism: 1.000 Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) Inbox burst: 6/6 events accepted by observerd (200 status, all recorded) 6/6 triggered searches produced distinct top-1 worker IDs distance distribution: 0.24 (Indy production) → 0.71 (Chicago drone surveyor — honest stretch since drones aren't in the 5K-worker corpus, system surfaces closest neighbor at high distance rather than fabricating) The drone-Chicago case is the architectural-honesty signal: when the demand asks for a specialist NOT in the roster, the system returns the closest semantic neighbor with a distance that flags "this is a stretch." Coordinators reading distances see "we don't have a great match here" rather than a confident wrong answer. Total events captured: 67 (was 61 pre-inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:34:36 -05:00
root	4da32ad102	embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) Local Ollama has three embedding models loaded: nomic-embed-text:latest 137M 768d (previous default) nomic-embed-text-v2-moe:latest 475M 768d (this commit's default) qwen3-embedding:latest 7.6B 4096d (would require dim change) v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE architecture. Workers index doesn't need rebuilding, just future ingests embed with the stronger model. Run #005 result on the multi-coord stress suite: Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9) → MoE is more discriminating: zero worker overlap across Milwaukee / Indianapolis / Chicago for shared role names. The geo + cert + skill context fully separates worker pools. Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff) Determinism: 1.000 (unchanged) Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) 200-worker swap: Jaccard 0.000 (unchanged — still perfect) Fresh-resume verify: STILL doesn't surface fresh workers in top-8. With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39) — the embedder is MORE discriminating, but the fresh worker's vector still doesn't outrank the 8th-best existing worker. Now suspect of being an HNSW post-build add issue (coder/hnsw incremental adds can land in hard-to-reach graph regions, not an embedder problem). Better embedder didn't fix it; needs a different strategy: full index rebuild after fresh adds, or explicit playbook-layer score boost for fresh workers, or hybrid (keyword + semantic) retrieval. Phase 3 investigation. Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s). Acceptable for the quality jump on diversity. Real production with incremental ingest won't pay this once-per-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:26:52 -05:00
root	84a32f0d29	multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap Three Phase 2 additions land in this commit: 1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific worker IDs out of results post-retrieval, AND skips them at the playbook boost+inject step (so excluded answers can't sneak back via Shape B). Real-world driver: coordinator placed N workers, client asks for replacements, system needs alternatives, not the same N. Threaded through retrieve.go after merge but before metadata filter so excluded IDs don't waste post-filter top-K slots. 2. New harness phase 2b: 200-worker swap simulation. Captures the top-K from alpha's warehouse query, then re-issues with exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether the substrate finds genuine alternatives. 3. New harness phase 1b: fresh-resume mid-run injection. Three new workers ingested via /v1/embed + /v1/vectors/index/workers/add, then verified findable via semantic queries matching resume content. Plus Hour labels on every event (operational narrative: 0/6/12/18/ 24/30/36/42/48) and a refactor of captureEvent to take hour as a param. Run #003 + #004 results (5K workers + 10K ethereal): Diversity (#004): Same-role-across-contracts Jaccard = 0.080 (n=9) Different-roles-same-contract Jaccard = 0.013 (n=18) Determinism: 1.000 (#004 unchanged) Verbatim handover: 4/4 = 100% Paraphrase handover: 4/4 = 100% Phase 2b — 200-worker swap (Jaccard 0.000): 8 originally-placed workers fully replaced by 8 alternatives. ExcludeIDs substrate change works end-to-end — boost AND inject both honor the exclusion, so excluded workers don't return via the playbook either. Phase 1b — fresh-resume injection: REAL PRODUCT FINDING. Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add calls at 200 status, 3 vectors persisted. But none of the 3 fresh workers surfaced in top-8 even with semantic queries matching their resume content (e.g. "Senior tower crane rigger NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12 years tower-crane signaling..." NCCCO + Chicago). Top-1 came from existing workers at distance ~0.25; fresh workers' distances must be > 0.25, pushing them past rank 8. Cause: dense retrieval at 5000+ workers means many existing profiles cluster near any specific query in cosine space; nomic-embed-text-v2 (137M) introduces enough noise that a fresh worker doesn't reliably outrank them just because the text content overlaps. Workarounds (Phase 3 work): (a) hybrid retrieval (keyword + semantic), (b) playbook-layer score boost for fresh adds, (c) larger embedder. Documented in run #004 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:19:29 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00

13 Commits