root e9822f025d playbook_lift v2: paraphrase pass + run #002 finds boost-only limit
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).

Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
  re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
  a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
  pass ran) and an honesty caveat about judge-also-rephrases coupling.

Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):

  Verbatim lift               2/2 (100% — Q7 + Q13, both stable from v1)
  Paraphrase top-1            0/2
  Paraphrase any-rank in K    0/2

Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.

The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.

Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:47:41 -05:00
..