golangLAKEHOUSE/reports/reality-tests/real_006_findings.md
root 95f155b017 real_006: distribution-shift test on rows 10-59 of fill_events
Methodology fix: gen_real_queries.go gains -offset N flag. Every prior
real_NNN test sourced queries from rows 0-9 of fill_events.parquet
(default -limit 10), so the substrate's published "8/10 cold-pass top-1
= judge-best" was measured on a memorized slice, not held-out data.

real_006 samples 50 fresh rows (offset 10, never seen by the workers
or ethereal_workers corpora). Same harness, same local qwen2.5:latest
judge, same K=10. ~14 min wall total. Local-only, no cloud calls.

Headline findings:

- Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's
  8/10 (80%) — substrate generalizes at rank level.
- Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's
  80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated
  1; the corpus has gaps for some role-city combos in the v3 slice.
- Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2)
- Paraphrase recovery: 6/9 → top-1, 9/9 any-rank
- Quality regressed: 3/50 — Q43 is the structural one

Q43 (Packer at Midway Distribution / Chicago IL) regressed from
rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and
`playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city)
recorded a playbook entry. The regression suggests Q18's recording
leaked into Q43 via the warm-pass playbook corpus retrieval surface
even though the role gate from real_002 should have blocked it.
Three possible paths: extractor failed on one query, gate fires on
boost path but not Shape B inject, or cosine drift puts the recorded
worker close enough to Q43's embedding that warm-pass retrieval picks
it up directly. Diagnosis is the next move.

Three same-(client, city) clusters tested:
- Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers
- Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25
  surface same worker w-281 for Assemblers vs Quality Techs at cold-
  pass), but no playbook bleed
- Midway Distribution Chicago IL × 3: Q43 regression as above

What this confirms: substrate works on the fresh distribution at the
rank level, verbatim lift is real, paraphrase recovery is real.

What this falsifies: real_002's role-gate fix is not structurally
airtight. The bleed pattern can still fire under conditions the
prior tests didn't reach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:54:03 -05:00

188 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reality test real_006 — distribution-shift findings
**Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest)
**Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query
**Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style)
**Corpora:** `workers,ethereal_workers` (5K + 10K)
**Local-only:** zero cloud calls per PRD line 70.
Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading.
---
## Why this test exists
real_001-005 all sourced their queries from the **first 10 rows** of
`fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no
`-offset N`, so every "real" reality test ran on the same memorized
slice. The published "8 / 10 cold-pass top-1 = judge-best" was a
property of those 10 rows, not measured generalization. real_006
closes the methodology gap: new `-offset` flag samples rows 10-59 (5×
the count, never seen by the substrate).
---
## Headline — substrate generalizes (mostly)
| Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict |
|---|---:|---:|---|
| Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** |
| Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts |
| Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% |
| Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable |
| Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** |
| Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new |
| Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new |
**Reading:** the substrate's *rank* behavior generalizes cleanly — the
top-1 worker is judge-approved at the same rate on fresh data as on
memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points,
which means 7 of the 41 "no-discovery" queries had cold top-1 the
judge rated 1 (irrelevant) but the corpus had nothing better. Honest
signal: parts of the v3 slice are in territory the workers corpus
doesn't cover well.
The verbatim-lift property (discovery → warm top-1) is **clean at
9/9**, matching real_001's 2/2 perfectly. When the playbook records,
the recorded answer comes back next time. That's the load-bearing
learning property.
---
## Cluster analysis — the cross-pollination question
real_001 found that same-(client, city) clusters cause Shape A boost
to bleed across roles. Real_002's role-gate fix (`roleEqual`) was
supposed to close that. real_006 has *more* cluster opportunities than
real_001 did:
| Cluster | Count | Result |
|---|---:|---|
| Riverfront Steel + Columbus OH | 4 | mostly clean — see below |
| Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing |
| Cornerstone Fabrication + Louisville KY | 3 | clean |
| Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** |
### Heritage Foods + Gary IN (3 queries, all clean)
```
Q14 Assemblers → e-1315
Q22 Material Handler → e-18
Q42 Machine Operator → e-1089
```
Three different roles → three different workers. Zero boosts fired,
zero playbooks recorded. **Role-disambiguation works at the cosine
level for this cluster.** Comparable to real_002's role-gate
demonstration.
### Riverfront Steel + Columbus OH (4 queries, partial)
```
Q9 Assemblers → w-281 (cold = warm, no boost)
Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9!
Q26 Machine Operator → w-4815 (clean)
Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded)
```
Q9 and Q25 both surface `w-281` cold-pass for *different roles*
that's a **cosine-level confusion** in the workers corpus, not a
playbook bleed. The substrate isn't breaking; the corpus contains a
worker whose resume embeds close to both "Assemblers" and "Quality
Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on
rejudge, which is the LLM's own consistency drift, not a substrate
fault. Worth noting but not a bug.
### Midway Distribution + Chicago IL (3 queries) — the regression
```
Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded)
Q19 Machine Operators → cold = warm e-1251 (clean)
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
```
**Q43 regressed from rating 5 (perfect match) to rating 2 (weak)
even though `warm_boosted_count=0` and `playbook_recorded=false`.**
Same query, different warm top-1, no boost flag set. The playbook
recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43
(Packer at Midway/Chicago) — same client+city, different role —
through the playbook corpus retrieval surface, even though the role
gate exists.
This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by
Q2's playbook), and the role-gate fix from real_002 (`roleEqual`
on `Role` field) was supposed to close it. Possible explanations:
1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43
("Packer") — leaving an empty role bypasses the gate (gate is
"permissive on empty" by design)
2. Gate fires on boost path but not on Shape B inject path — and
"boost=0" in the JSON is `warm_boosted_count` (count of
re-ranked entries), not a flag for "no playbook influence at all"
3. Cosine-level drift: the playbook entry just happens to be close
enough to Q43 in raw cosine space that warm-pass retrieval picks
up `w-279` directly without going through boost or inject
The other regressions (Q4 Centennial Packaging Flint MI, Q25 above)
are smaller (3→2 and 2→1) and likely judge consistency drift on
borderline candidates. Q43 is the structural one.
---
## What this confirms vs falsifies
**Confirmed:**
- Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
- Verbatim lift works (9/9 discoveries → warm top-1)
- Role-disambiguation works at cosine level for clean role-distinct
query distributions (Heritage Foods cluster is the proof)
- Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)
**Falsified / weakened:**
- "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on
the strict (rating ≥ 2) interpretation. Real number on broader
data is ~68%, not 80%. Headline rank-match number (82%) holds.
- Real_002's role-gate fix is **not structurally airtight**. Q43
shows the cluster-bleed pattern can still fire under conditions
the prior tests didn't reach. Open question: which path is
leaking — extractor failure, gate scope, or cosine drift?
---
## Next moves (informed by this evidence)
1. **Diagnose Q43 specifically**: re-run the role extractor on its
query text, check whether Q18's playbook entry has a role field
recorded, look at the warm-pass top-K to see whether `w-279`
reaches there via boost, inject, or cosine-only.
2. **Strengthen the corpus for the role-city combos that scored
low rating** (the 7 queries where cold top-1 was rating=1). The
workers corpus has gaps the v3 slice surfaced.
3. **Don't ship the "80% generalizes" framing as-is.** The number
real_006 measured (82% rank, 68% rating ≥ 2) is the honest one
to publish.
This is what reality tests are for. Numbers from the memorized slice
gave a clean story; numbers from the held-out slice show where it
needs work.
---
## Repro
```bash
cd /home/profit/golangLAKEHOUSE
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt
PATH=/usr/local/go/bin:$PATH \
RUN_ID=real_006 \
JUDGE_MODEL=qwen2.5:latest \
QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
WITH_PARAPHRASE=1 \
WITH_REJUDGE=1 \
bash scripts/playbook_lift.sh
```
Local-only. No cloud calls.