23 Commits

Author SHA1 Message Date
root
95f155b017 real_006: distribution-shift test on rows 10-59 of fill_events
Methodology fix: gen_real_queries.go gains -offset N flag. Every prior
real_NNN test sourced queries from rows 0-9 of fill_events.parquet
(default -limit 10), so the substrate's published "8/10 cold-pass top-1
= judge-best" was measured on a memorized slice, not held-out data.

real_006 samples 50 fresh rows (offset 10, never seen by the workers
or ethereal_workers corpora). Same harness, same local qwen2.5:latest
judge, same K=10. ~14 min wall total. Local-only, no cloud calls.

Headline findings:

- Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's
  8/10 (80%) — substrate generalizes at rank level.
- Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's
  80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated
  1; the corpus has gaps for some role-city combos in the v3 slice.
- Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2)
- Paraphrase recovery: 6/9 → top-1, 9/9 any-rank
- Quality regressed: 3/50 — Q43 is the structural one

Q43 (Packer at Midway Distribution / Chicago IL) regressed from
rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and
`playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city)
recorded a playbook entry. The regression suggests Q18's recording
leaked into Q43 via the warm-pass playbook corpus retrieval surface
even though the role gate from real_002 should have blocked it.
Three possible paths: extractor failed on one query, gate fires on
boost path but not Shape B inject, or cosine drift puts the recorded
worker close enough to Q43's embedding that warm-pass retrieval picks
it up directly. Diagnosis is the next move.

Three same-(client, city) clusters tested:
- Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers
- Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25
  surface same worker w-281 for Assemblers vs Quality Techs at cold-
  pass), but no playbook bleed
- Midway Distribution Chicago IL × 3: Q43 regression as above

What this confirms: substrate works on the fresh distribution at the
rank level, verbatim lift is real, paraphrase recovery is real.

What this falsifies: real_002's role-gate fix is not structurally
airtight. The bleed pattern can still fire under conditions the
prior tests didn't reach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:54:03 -05:00
root
cca32344f3 reality_test real_005: negation probe — substrate gap is correctly out-of-scope
5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".

Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.

Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
  because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK

The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.

No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
  (already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
  pretend to honor unparseable constraints

Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.

Findings: reports/reality-tests/real_005_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:06:06 -05:00
root
0331288641 playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:51:27 -05:00
root
3263254f1c reality_test real_003: 40-query paraphrase stress + extractor extension
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.

real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.

Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.

real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.

scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.

scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).

Aggregate metrics:
- real_003  (original): disc=7,  lift=7,  boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).

Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:42:02 -05:00
root
997527be4d matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1)
real_001 surfaced same-client+city queries bleeding across roles:
Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193
in the playbook corpus. Q#5 (Pickers same client+city) and Q#10
(CNC Operator same client+city) embedded within 0.13-0.18 cosine of
Q#2's query — well inside the 0.20 inject threshold — so e-6193
injected on both, demoting the cold-pass-correct workers.

Root cause: the inject distance threshold isn't tight enough on
the same-client+city cluster. Cosine collapses queries that share
city + client + count-token + time-token regardless of role. The
existing judge gate is per-injection at record time and doesn't
fire at retrieve time.

Fix: structural role gate in front of both Shape A boost and
Shape B inject. PlaybookEntry gains Role; SearchRequest gains
QueryRole. When both are non-empty and differ under roleEqual's
case+plural normalization, the entry is rejected before BoostFactor
or judge-gate logic runs.

Backward-compat: empty role on either side disables the gate —
preserves behavior for the lift suite's free-form multi-constraint
queries that have no clean single role. Caller-supplied (not
inferred), so existing recordings unaffected.

Wire-through:
- internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole,
  roleEqual helper with plural+case normalization
- internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded
  to both ApplyPlaybookBoost + InjectPlaybookMisses
- cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk
- scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls
  role from "Need N {role}{s} in" queries (the fill_events shape);
  free-form queries fall back to empty (gate disabled)

Tests (5 new):
- TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10
  scenario (distance 0.135, recorded "Forklift Operator", query
  "CNC Operator") — locks the bleed at unit level
- TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator
  recording fires on Forklift Operators query (plural normalization)
- TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on
  either side = gate disabled, preserves current behavior
- TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense
  in depth — boost doesn't fire on cross-role even when answer is
  in cold top-K
- TestRoleEqual_PluralAndCase: case + -s + -es plural normalization

Verification (real_002, same query set as real_001):
- Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed)
- Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed)
- Discoveries + lifts unchanged at 2 each (same-role lift still fires)
- Mean Δdist tightens from -0.127 to -0.040 (boosts no longer
  pulling distances through the floor on cross-role mismatches)

Findings: reports/reality-tests/real_002_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:34:10 -05:00
root
7f2f112e6a reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".

Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.

Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.

Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.

Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.

Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.

Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.

Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading

Repro:
  go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
  QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
    WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:18:40 -05:00
root
5d49967833 multi_coord_stress: full Langfuse coverage — every phase + every call
Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.

Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
  child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
  with span emission. Every search call site replaced with this so
  the input/output JSON (query, corpora, k, playbook, exclude_n →
  top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
  helpers (no more inboxTraceID variable).

Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
  phase.* parents:         9 (one per phase that ran)
  matrix.search.baseline:  10
  matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
  observerd.inbox.record:  6
  llm.parse_demand:        6
  matrix.search.inbox:     6
  llm.judge_top1:          6
  matrix.search.surge:     12
  matrix.search.swap_orig: 1
  matrix.search.swap_replace: 1
  matrix.search.merge:     6
  matrix.search.handover_verbatim: 4
  llm.paraphrase:          4
  matrix.search.handover_paraphrase: 4
  matrix.search.split:     4
  matrix.search.reissue:   12
  matrix.search.reissue_retrieval_only: 12
  ─────────────
  Total:                   111

Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.

All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.

This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:43:32 -05:00
root
08a086779b multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1
Runs #003-#009 surfaced the same finding: fresh workers added
mid-run to the main 'workers' vectord index (5K items) reliably
*absorbed* (HTTP 200) but failed to *surface* in semantic queries
even with content-matching prompts. Distances on the verify queries
sat at 0.25-0.65 against existing workers; fresh items were beyond
top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER
on existing items, pushing fresh items further out of reach.

Root cause: coder/hnsw incremental adds to a populated graph land
in poorly-connected regions and disappear from search traversal.
Known property of HNSW post-build adds; not a bug.

Fix: two-tier index pattern (canonical NRT search architecture).
Fresh content goes to a small "hot" corpus (fresh_workers); main
queries include it in the corpora list and merge results. Hot corpus
has no recall crowding because it's tiny; periodic batch job (post-
G3) merges it into the main index.

Implementation:
- ensureFreshIndex(hc, gw, name, dim) — idempotent POST
  /v1/vectors/index. 409 from re-create treated as "already there."
- ingestFreshWorker now takes idx parameter so callers can target
  fresh_workers instead of workers.
- multi_coord_stress phase 1b creates fresh_workers index + ingests
  3 fresh workers there + searches verifyCorpora=[workers,
  ethereal_workers, fresh_workers].

Run #010 result:
  fresh-001 (Senior tower crane rigger NCCCO Chicago)
    top-1: fresh-001 from fresh_workers, distance 0.143
  fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis)
    top-1: fresh-002 from fresh_workers, distance 0.146
  fresh-003 (FAA Part 107 drone surveyor Chicago)
    top-1: fresh-003 from fresh_workers, distance 0.129

3/3 fresh workers surface at top-1 — the absorption-but-not-
findable issue from runs #003-#009 is closed.

All other metrics held: diversity 0.007, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000,
inbox burst all 6 events accepted + traced to Langfuse.

This is the final structural fix for the multi-coord stress
suite. Phase 3 is feature-complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:31:45 -05:00
root
7e6431e4fd langfuse: Go-side client + Phase 1c instrumentation
The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs);
this commit lands Go-side parity so the multi-coord stress harness can
emit traces visible at http://localhost:3001.

internal/langfuse/client.go:
- Minimal Trace + Span + Flush API mirroring what the Rust emitter
  uses. Auth: Basic over public_key:secret_key.
- Best-effort posture: errors are slog.Warn'd, never block calling
  paths. Same fail-open as observerd's persistor (ADR-005 Decision
  5.1) — observability is a witness, not a gate.
- Events buffered until 50, then auto-flushed; explicit Flush() at
  process exit.
- Each Trace/Span returns its id so callers can build hierarchies.

multi_coord_stress driver wiring:
- New --langfuse-env flag (default /etc/lakehouse/langfuse.env).
  Empty / missing / unparseable file → skip tracing with a logged
  warning; run still proceeds.
- Phase 1c (inbox burst) now emits one parent trace + 4 spans per
  inbox event:
    1. observerd.inbox.record  (post to /v1/observer/inbox)
    2. llm.parse_demand        (qwen2.5 → structured fields)
    3. matrix.search           (parsed query → top-K)
    4. llm.judge_top1          (rate top-1 vs original body)
  Each span carries input/output JSON + start/end times so the
  Langfuse UI shows a full waterfall per event.

Run #009 result:
  Trace landed: "multi_coord_stress phase 1c inbox burst"
  Observations attached: 24 (= 6 events × 4 spans)
  Tags: stress, phase-1c, inbox
  Browseable at http://localhost:3001 by tag query.

Other harness metrics: diversity 0.016, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4 — all unchanged
by the tracing addition (best-effort post in parallel).

Phase 1c is the proof-of-concept; future commits can wrap other
phases (baseline / merge / handover / split) in traces too. Once
that's done, the entire stress run becomes scrubbable in Langfuse
without grepping the events JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:25:03 -05:00
root
ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
  - distance: how close was retrieval in vector space
  - rating:   does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

  Demand                Top-1     Distance  Rating  Reading
  ─────────────────────────────────────────────────────────────
  Forklift Cleveland    w-3573    0.29      4       Strong
  Production Indy       e-1764    0.41      3       Adjacent
  Crane Chicago         e-7798    0.23      1       TIGHT BUT WRONG
  Bilingual safety Indy w-3918    0.05      5       Perfect
  Drone Chicago         e-1058    0.06      5       Perfect (verify e-1058)
  Warehouse Milwaukee   w-460     0.32      4       Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
  events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
  playbook_lift's judgeRate. The two could share an internal package
  if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:16:49 -05:00
root
186d209aae multi_coord_stress: LLM-parsed inbox demands (qwen2.5)
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.

This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.

Run #007 result vs #006 (same bodies, parser swapped):

  All 6 inbox events parsed cleanly — qwen2.5 nailed:
    "Need 50 forklift operators in Cleveland OH for Monday day
     shift. OSHA-30 + active forklift cert required."
    → {role:"forklift operator", count:50, location:"Cleveland, OH",
       certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
    Other 5 similarly faithful (indy stayed as "indy", count
    defaulted to 1 when unspecified, no hallucinated fields).

  LLM-parsed queries produced TIGHTER matches than hard-coded:
    Demand              #006 dist  #007 dist  Δ
    Crane Chicago       0.499      0.093      -82%
    Drone Chicago       0.707      0.073      -90%
    Bilingual safety    0.240      0.048      -80%
    Forklift Cleveland  0.330      0.273      -17%
    Production Indy     0.260      0.399      +53%
    Warehouse Milwaukee 0.458      0.420       -8%

  Three matches landed at distance < 0.10 — verbatim-replay-tight
  territory. Structured queries embed sharper than conversational
  hand-crafted strings.

  Other metrics unchanged: diversity 0.000, determinism 1.000,
  verbatim handover 4/4, paraphrase handover 4/4.

Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.

Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:51:19 -05:00
root
e7fc63b216 observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events)
Phase 3 ask: real-world inbox-style event injection during the stress
test. Coordinators in production receive emails + SMS that trigger
contract responses; the substrate has to RECORD these signals AND
react with a search using the embedded demand. This commit lands the
endpoint and exercises it end-to-end in the stress harness.

observerd surface:
- New POST /observer/inbox route — accepts {type, sender, subject,
  body, priority, tag} and records as ObservedOp with
  Source=SourceInbox. Type must be email|sms; body required;
  priority defaults to medium. The handler ONLY records — downstream
  triggers (search, ingest, etc.) are the caller's concern, recorded
  separately. Keeps the witness role pure.
- New observer.SourceInbox = "inbox" alongside SourceMCP /
  SourceScenario / SourceWorkflow.
- Three contract tests on the new route (happy path / bad type / empty
  body), router-mount test extended, all green.

Stress harness phase 1c (Hour 9):
- 6 inbox events fire in priority order (urgent → high → medium):
    2 urgent emails (forklift Cleveland, production Indianapolis)
    1 high email (crane Chicago)
    1 high sms (bilingual safety Indianapolis)
    1 medium sms (drone Chicago)
    1 medium email (warehouse Milwaukee FYI)
- Each event:
    1. POSTs to /v1/observer/inbox (recorded by observerd)
    2. Triggers matrix.search using a parsed demand (the demand
       extraction is hard-coded for now; production needs a small
       LLM to parse from body)
    3. Captures both as events in the run JSON

Run #006 result (with v2-moe embedder + all phases including inbox):

  Diversity:
    Same-role-across-contracts Jaccard = 0.000 (n=9)
    Different-roles-same-contract Jaccard = 0.046 (n=18)
  Determinism: 1.000
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)
  Inbox burst:
    6/6 events accepted by observerd (200 status, all recorded)
    6/6 triggered searches produced distinct top-1 worker IDs
    distance distribution: 0.24 (Indy production) → 0.71 (Chicago
    drone surveyor — honest stretch since drones aren't in the
    5K-worker corpus, system surfaces closest neighbor at high
    distance rather than fabricating)

The drone-Chicago case is the architectural-honesty signal: when
the demand asks for a specialist NOT in the roster, the system
returns the closest semantic neighbor with a distance that flags
"this is a stretch." Coordinators reading distances see "we don't
have a great match here" rather than a confident wrong answer.

Total events captured: 67 (was 61 pre-inbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:34:36 -05:00
root
4da32ad102 embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
Local Ollama has three embedding models loaded:
  nomic-embed-text:latest        137M  768d  (previous default)
  nomic-embed-text-v2-moe:latest 475M  768d  (this commit's default)
  qwen3-embedding:latest         7.6B  4096d (would require dim change)

v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.

Run #005 result on the multi-coord stress suite:

  Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
    → MoE is more discriminating: zero worker overlap across
      Milwaukee / Indianapolis / Chicago for shared role names.
      The geo + cert + skill context fully separates worker pools.
  Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
  Determinism: 1.000 (unchanged)
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)

  200-worker swap: Jaccard 0.000 (unchanged — still perfect)

  Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
    With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
    — the embedder is MORE discriminating, but the fresh worker's
    vector still doesn't outrank the 8th-best existing worker. Now
    suspect of being an HNSW post-build add issue (coder/hnsw
    incremental adds can land in hard-to-reach graph regions, not an
    embedder problem). Better embedder didn't fix it; needs a
    different strategy: full index rebuild after fresh adds, or
    explicit playbook-layer score boost for fresh workers, or
    hybrid (keyword + semantic) retrieval. Phase 3 investigation.

Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:26:52 -05:00
root
84a32f0d29 multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap
Three Phase 2 additions land in this commit:

1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific
   worker IDs out of results post-retrieval, AND skips them at the
   playbook boost+inject step (so excluded answers can't sneak back
   via Shape B). Real-world driver: coordinator placed N workers,
   client asks for replacements, system needs alternatives, not the
   same N. Threaded through retrieve.go after merge but before
   metadata filter so excluded IDs don't waste post-filter top-K slots.

2. New harness phase 2b: 200-worker swap simulation. Captures the
   top-K from alpha's warehouse query, then re-issues with
   exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether
   the substrate finds genuine alternatives.

3. New harness phase 1b: fresh-resume mid-run injection. Three new
   workers ingested via /v1/embed + /v1/vectors/index/workers/add,
   then verified findable via semantic queries matching resume content.

Plus Hour labels on every event (operational narrative: 0/6/12/18/
24/30/36/42/48) and a refactor of captureEvent to take hour as a
param.

Run #003 + #004 results (5K workers + 10K ethereal):

  Diversity (#004):
    Same-role-across-contracts Jaccard = 0.080 (n=9)
    Different-roles-same-contract Jaccard = 0.013 (n=18)
  Determinism: 1.000 (#004 unchanged)
  Verbatim handover:  4/4 = 100%
  Paraphrase handover: 4/4 = 100%

  Phase 2b — 200-worker swap (Jaccard 0.000):
    8 originally-placed workers fully replaced by 8 alternatives.
    ExcludeIDs substrate change works end-to-end — boost AND inject
    both honor the exclusion, so excluded workers don't return via
    the playbook either.

  Phase 1b — fresh-resume injection: REAL PRODUCT FINDING.
    Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add
    calls at 200 status, 3 vectors persisted. But none of the 3
    fresh workers surfaced in top-8 even with semantic queries
    matching their resume content (e.g. "Senior tower crane rigger
    NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12
    years tower-crane signaling..." NCCCO + Chicago).
    Top-1 came from existing workers at distance ~0.25; fresh
    workers' distances must be > 0.25, pushing them past rank 8.
    Cause: dense retrieval at 5000+ workers means many existing
    profiles cluster near any specific query in cosine space;
    nomic-embed-text-v2 (137M) introduces enough noise that a
    fresh worker doesn't reliably outrank them just because the
    text content overlaps.
    Workarounds (Phase 3 work): (a) hybrid retrieval (keyword +
    semantic), (b) playbook-layer score boost for fresh adds,
    (c) larger embedder. Documented in run #004 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:19:29 -05:00
root
0fa42a0cc3 multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).

Both fixed in this commit.

Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.

New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.

Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):

  Diversity (the question J asked: locking or cycling?):
    Same-role-across-contracts Jaccard = 0.119 (n=9)
      → 88% of workers DIFFER across regions for the same role name.
        Milwaukee warehouse vs Indianapolis warehouse vs Chicago
        warehouse pull mostly distinct top-K from the same population.
        The system locks into geo+cert+skill context, not cycling.
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval works (unchanged from Phase 1).

  Determinism: Jaccard = 1.000 (n=12) — unchanged.

  Learning:
    Verbatim handover  4/4 = 100%  (trivial case, expected)
    Paraphrase handover 4/4 = 100% (HARD case — passes!)
      Of those 4 paraphrase recoveries:
        - 2 used boost (Alice's recording was already in Bob's
          paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
        - 2 used Shape B inject (recording wasn't in Bob's
          paraphrase top-K; InjectPlaybookMisses brought it in)

The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.

Sample warehouse worker top-1s across contracts (proves diversity):
  alice / Milwaukee     → w-713
  bob   / Indianapolis  → e-8447
  carol / Chicago       → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:16 -05:00
root
61c7b55e48 multi-coord stress harness — Phase 1 of 48-hour mock
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.

Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.

Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):

  Diversity:
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval is working perfectly. Different
        roles within one contract pull totally different worker
        pools. System is NOT cycling; locks into per-role retrieval.
    Same-role-across-contracts Jaccard = N/A (n=0)
      → TEST-DESIGN ISSUE: the 3 contracts use distinct role names
        per industry (warehouse worker / production worker / general
        laborer), so no exact-name overlaps exist. Phase 2 should
        either share at least one role across contracts OR add a
        skill-based diversity metric.

  Determinism: Jaccard = 1.000 (n=12)
    → HNSW + Ollama retrieval is fully deterministic on identical
      query text. coder/hnsw + nomic-embed-text are stable.

  Learning: handover hit rate = 4/4 = 100%
    → Bob inherits Alice's recordings perfectly when bob runs
      identical queries with alice's playbook namespace. CAVEAT:
      this tests the trivial verbatim case, not paraphrase handover.
      The harder test (bob runs paraphrased queries with alice's
      playbook) is Phase 2 work.

Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
  jq '.events[] | select(.phase == "merge")'
  jq '.events[] | select(.coordinator == "alice")'
  jq '.events[] | select(.role == "warehouse worker")'

Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:55:29 -05:00
root
b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00
root
67d1957b87 matrix: split boost / inject thresholds — kills Shape B cross-pollination
Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift
Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental
hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated
staffing queries. Cause: InjectPlaybookMisses inherited the same
DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is
structurally riskier than boost — boost only re-ranks results that
already retrieved on their own merits, while inject FORCES a result
into top-K, so a loose match cross-pollinates wrong-domain answers.

Empirical motivation from v3:
  Implied playbook hit distances for cross-pollinated cases: 0.20-0.46
  Implied distances for the 6/6 paraphrase recoveries:        0.23-0.30
  Threshold of 0.20 should keep most paraphrases, kill the OOD bleed.

Implementation:
- New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go.
- New PlaybookMaxInjectDistance field on SearchRequest (override).
- InjectPlaybookMisses signature gains maxInjectDist param; hits whose
  Distance exceeds it are skipped (boost path may still re-rank them).
- TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract
  with one tight + one loose hit, asserting only the tight one injects.
- Existing tests pass explicit threshold (0 = default for tight tests,
  0.5 for the dedupe test which uses 0.30 hits).

Run #004 result on identical queries with the split threshold:

  Verbatim discovery        8 (vs v3's 6 — judge variance, separate)
  Verbatim lift             6 / 8 (75%)
  Paraphrase top-1          6 / 8 (75%)
  Paraphrase any-rank in K  6 / 8

OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no
injection) — cross-pollination eliminated where it was wrong-direction.
Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071
(v4, comparable to v1's -0.053).

Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5
rephrased liberally enough to drift past 0.20 — Q9: "Inventory
specialist..." → "Individual needed for inventory management..." and
Q15: "Engaged warehouse associate..." → "Warehouse associate currently
engaged with a robust history...". The system correctly refusing to
inject when it's not confident is the right product behavior; the
boost path still re-ranks recorded answers when they appear in regular
retrieval.

The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔
"Hazmat warehouse worker") is legitimate — these are genuinely similar
staffing queries and the judge ranks both directions as plausible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:24:55 -05:00
root
154a72ea5e matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).

Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.

Result on playbook_lift_003 (Shape B + paraphrase pass):

  Verbatim discovery        6
  Verbatim lift             2 / 6
  **Paraphrase top-1**      **6 / 6**
  Paraphrase any-rank in K  6 / 6
  Mean Δ top-1 distance     -0.1637 (warm closer than cold)

Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.

Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.

Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op

Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00
root
e9822f025d playbook_lift v2: paraphrase pass + run #002 finds boost-only limit
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).

Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
  re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
  a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
  pass ran) and an honesty caveat about judge-also-rephrases coupling.

Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):

  Verbatim lift               2/2 (100% — Q7 + Q13, both stable from v1)
  Paraphrase top-1            0/2
  Paraphrase any-rank in K    0/2

Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.

The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.

Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:47:41 -05:00
root
b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:22:21 -05:00
root
848cbf5fef phase 3: playbook_lift harness reads judge from config
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.

resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.

bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.

changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
  flips to "" so resolution chain runs. Imports internal/shared for
  config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
  EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
  grep > qwen3.5:latest fallback). Used for the Ollama presence
  check + report header. Pre-flight grep avoids requiring jq just
  to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
  chain.

verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override:    env wins
- flag override:   flag wins over env
- missing config:  DefaultConfig fallback still gives qwen3.5:latest

just verify PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:57:28 -05:00
root
3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00