4 Commits

Author SHA1 Message Date
root
4da32ad102 embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
Local Ollama has three embedding models loaded:
  nomic-embed-text:latest        137M  768d  (previous default)
  nomic-embed-text-v2-moe:latest 475M  768d  (this commit's default)
  qwen3-embedding:latest         7.6B  4096d (would require dim change)

v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.

Run #005 result on the multi-coord stress suite:

  Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
    → MoE is more discriminating: zero worker overlap across
      Milwaukee / Indianapolis / Chicago for shared role names.
      The geo + cert + skill context fully separates worker pools.
  Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
  Determinism: 1.000 (unchanged)
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)

  200-worker swap: Jaccard 0.000 (unchanged — still perfect)

  Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
    With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
    — the embedder is MORE discriminating, but the fresh worker's
    vector still doesn't outrank the 8th-best existing worker. Now
    suspect of being an HNSW post-build add issue (coder/hnsw
    incremental adds can land in hard-to-reach graph regions, not an
    embedder problem). Better embedder didn't fix it; needs a
    different strategy: full index rebuild after fresh adds, or
    explicit playbook-layer score boost for fresh workers, or
    hybrid (keyword + semantic) retrieval. Phase 3 investigation.

Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:26:52 -05:00
root
84a32f0d29 multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap
Three Phase 2 additions land in this commit:

1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific
   worker IDs out of results post-retrieval, AND skips them at the
   playbook boost+inject step (so excluded answers can't sneak back
   via Shape B). Real-world driver: coordinator placed N workers,
   client asks for replacements, system needs alternatives, not the
   same N. Threaded through retrieve.go after merge but before
   metadata filter so excluded IDs don't waste post-filter top-K slots.

2. New harness phase 2b: 200-worker swap simulation. Captures the
   top-K from alpha's warehouse query, then re-issues with
   exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether
   the substrate finds genuine alternatives.

3. New harness phase 1b: fresh-resume mid-run injection. Three new
   workers ingested via /v1/embed + /v1/vectors/index/workers/add,
   then verified findable via semantic queries matching resume content.

Plus Hour labels on every event (operational narrative: 0/6/12/18/
24/30/36/42/48) and a refactor of captureEvent to take hour as a
param.

Run #003 + #004 results (5K workers + 10K ethereal):

  Diversity (#004):
    Same-role-across-contracts Jaccard = 0.080 (n=9)
    Different-roles-same-contract Jaccard = 0.013 (n=18)
  Determinism: 1.000 (#004 unchanged)
  Verbatim handover:  4/4 = 100%
  Paraphrase handover: 4/4 = 100%

  Phase 2b — 200-worker swap (Jaccard 0.000):
    8 originally-placed workers fully replaced by 8 alternatives.
    ExcludeIDs substrate change works end-to-end — boost AND inject
    both honor the exclusion, so excluded workers don't return via
    the playbook either.

  Phase 1b — fresh-resume injection: REAL PRODUCT FINDING.
    Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add
    calls at 200 status, 3 vectors persisted. But none of the 3
    fresh workers surfaced in top-8 even with semantic queries
    matching their resume content (e.g. "Senior tower crane rigger
    NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12
    years tower-crane signaling..." NCCCO + Chicago).
    Top-1 came from existing workers at distance ~0.25; fresh
    workers' distances must be > 0.25, pushing them past rank 8.
    Cause: dense retrieval at 5000+ workers means many existing
    profiles cluster near any specific query in cosine space;
    nomic-embed-text-v2 (137M) introduces enough noise that a
    fresh worker doesn't reliably outrank them just because the
    text content overlaps.
    Workarounds (Phase 3 work): (a) hybrid retrieval (keyword +
    semantic), (b) playbook-layer score boost for fresh adds,
    (c) larger embedder. Documented in run #004 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:19:29 -05:00
root
0fa42a0cc3 multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).

Both fixed in this commit.

Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.

New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.

Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):

  Diversity (the question J asked: locking or cycling?):
    Same-role-across-contracts Jaccard = 0.119 (n=9)
      → 88% of workers DIFFER across regions for the same role name.
        Milwaukee warehouse vs Indianapolis warehouse vs Chicago
        warehouse pull mostly distinct top-K from the same population.
        The system locks into geo+cert+skill context, not cycling.
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval works (unchanged from Phase 1).

  Determinism: Jaccard = 1.000 (n=12) — unchanged.

  Learning:
    Verbatim handover  4/4 = 100%  (trivial case, expected)
    Paraphrase handover 4/4 = 100% (HARD case — passes!)
      Of those 4 paraphrase recoveries:
        - 2 used boost (Alice's recording was already in Bob's
          paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
        - 2 used Shape B inject (recording wasn't in Bob's
          paraphrase top-K; InjectPlaybookMisses brought it in)

The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.

Sample warehouse worker top-1s across contracts (proves diversity):
  alice / Milwaukee     → w-713
  bob   / Indianapolis  → e-8447
  carol / Chicago       → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:16 -05:00
root
61c7b55e48 multi-coord stress harness — Phase 1 of 48-hour mock
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.

Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.

Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):

  Diversity:
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval is working perfectly. Different
        roles within one contract pull totally different worker
        pools. System is NOT cycling; locks into per-role retrieval.
    Same-role-across-contracts Jaccard = N/A (n=0)
      → TEST-DESIGN ISSUE: the 3 contracts use distinct role names
        per industry (warehouse worker / production worker / general
        laborer), so no exact-name overlaps exist. Phase 2 should
        either share at least one role across contracts OR add a
        skill-based diversity metric.

  Determinism: Jaccard = 1.000 (n=12)
    → HNSW + Ollama retrieval is fully deterministic on identical
      query text. coder/hnsw + nomic-embed-text are stable.

  Learning: handover hit rate = 4/4 = 100%
    → Bob inherits Alice's recordings perfectly when bob runs
      identical queries with alice's playbook namespace. CAVEAT:
      this tests the trivial verbatim case, not paraphrase handover.
      The harder test (bob runs paraphrased queries with alice's
      playbook) is Phase 2 work.

Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
  jq '.events[] | select(.phase == "merge")'
  jq '.events[] | select(.coordinator == "alice")'
  jq '.events[] | select(.role == "warehouse worker")'

Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:55:29 -05:00