9 Commits

Author SHA1 Message Date
root
0331288641 playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:51:27 -05:00
root
3263254f1c reality_test real_003: 40-query paraphrase stress + extractor extension
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.

real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.

Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.

real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.

scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.

scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).

Aggregate metrics:
- real_003  (original): disc=7,  lift=7,  boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).

Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:42:02 -05:00
root
997527be4d matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1)
real_001 surfaced same-client+city queries bleeding across roles:
Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193
in the playbook corpus. Q#5 (Pickers same client+city) and Q#10
(CNC Operator same client+city) embedded within 0.13-0.18 cosine of
Q#2's query — well inside the 0.20 inject threshold — so e-6193
injected on both, demoting the cold-pass-correct workers.

Root cause: the inject distance threshold isn't tight enough on
the same-client+city cluster. Cosine collapses queries that share
city + client + count-token + time-token regardless of role. The
existing judge gate is per-injection at record time and doesn't
fire at retrieve time.

Fix: structural role gate in front of both Shape A boost and
Shape B inject. PlaybookEntry gains Role; SearchRequest gains
QueryRole. When both are non-empty and differ under roleEqual's
case+plural normalization, the entry is rejected before BoostFactor
or judge-gate logic runs.

Backward-compat: empty role on either side disables the gate —
preserves behavior for the lift suite's free-form multi-constraint
queries that have no clean single role. Caller-supplied (not
inferred), so existing recordings unaffected.

Wire-through:
- internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole,
  roleEqual helper with plural+case normalization
- internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded
  to both ApplyPlaybookBoost + InjectPlaybookMisses
- cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk
- scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls
  role from "Need N {role}{s} in" queries (the fill_events shape);
  free-form queries fall back to empty (gate disabled)

Tests (5 new):
- TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10
  scenario (distance 0.135, recorded "Forklift Operator", query
  "CNC Operator") — locks the bleed at unit level
- TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator
  recording fires on Forklift Operators query (plural normalization)
- TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on
  either side = gate disabled, preserves current behavior
- TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense
  in depth — boost doesn't fire on cross-role even when answer is
  in cold top-K
- TestRoleEqual_PluralAndCase: case + -s + -es plural normalization

Verification (real_002, same query set as real_001):
- Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed)
- Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed)
- Discoveries + lifts unchanged at 2 each (same-role lift still fires)
- Mean Δdist tightens from -0.127 to -0.040 (boosts no longer
  pulling distances through the floor on cross-role mismatches)

Findings: reports/reality-tests/real_002_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:34:10 -05:00
root
b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00
root
154a72ea5e matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).

Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.

Result on playbook_lift_003 (Shape B + paraphrase pass):

  Verbatim discovery        6
  Verbatim lift             2 / 6
  **Paraphrase top-1**      **6 / 6**
  Paraphrase any-rank in K  6 / 6
  Mean Δ top-1 distance     -0.1637 (warm closer than cold)

Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.

Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.

Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op

Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00
root
e9822f025d playbook_lift v2: paraphrase pass + run #002 finds boost-only limit
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).

Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
  re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
  a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
  pass ran) and an honesty caveat about judge-also-rephrases coupling.

Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):

  Verbatim lift               2/2 (100% — Q7 + Q13, both stable from v1)
  Paraphrase top-1            0/2
  Paraphrase any-rank in K    0/2

Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.

The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.

Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:47:41 -05:00
root
b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:22:21 -05:00
root
848cbf5fef phase 3: playbook_lift harness reads judge from config
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.

resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.

bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.

changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
  flips to "" so resolution chain runs. Imports internal/shared for
  config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
  EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
  grep > qwen3.5:latest fallback). Used for the Ollama presence
  check + report header. Pre-flight grep avoids requiring jq just
  to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
  chain.

verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override:    env wins
- flag override:   flag wins over env
- missing config:  DefaultConfig fallback still gives qwen3.5:latest

just verify PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:57:28 -05:00
root
3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00