2 Commits

Author SHA1 Message Date
root
31b408882b multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding
Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.

Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.

Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.

Result: hypothesis falsified. Direct probe of workers_500k.parquet:

  All 500K rows         → 55,349 Forklift Operators (11.07%)
                       → 150,328 with "forklift" in certs
                       → 74,852 with OSHA-30 specifically
  First 5K rows         → 569 Forklift Operators (11.38%)
                       → distribution matches global, no ordering bias

So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".

The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:

  1. Embed text design — front-loading role + certs (currently
     "Worker role: <role>" then skills then certs) might dominate
     retrieval better. Worth A/B-testing.
  2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
     before semantic ranking. Not in SPEC §3.4 today; would address
     the "available" / "Chicago" gap from the candidates reality
     test (0d1553c) too.
  3. Playbook-memory boost — SPEC §3.4 component 5. When a query
     "Forklift OSHA-30" was answered with worker w-X in the past,
     boost w-X's score for similar future queries. The retrieval
     gap CAN be bridged by the learning loop without changing the
     base embedder.

Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:32 -05:00
root
a97881d80c workers corpus + multi-corpus reality test — matrix indexer end-to-end
Lands the second real-data corpus (workers_500k) and the first
multi-corpus reality test through /v1/matrix/search composing both
corpora live.

What's new:
  - scripts/staffing_workers/main.go — parquet driver over
    workers_500k.parquet, multi-chunk arrow handling (workers
    parquet has multiple row groups vs candidates' one). Embed text:
    role + skills + certifications + city + state + archetype +
    resume_text. IDs prefixed "w-".
  - scripts/multi_corpus_e2e.sh — first end-to-end test composing
    both corpora through the matrix indexer.

Real-data multi-corpus result (this commit):
  Query: "Forklift operator with OSHA-30 certification, warehouse
          experience"
  Corpora: workers (5000 rows) + candidates (1000 rows)
  Merged top-8: workers=6, candidates=2

  Top hits:
    w  d=0.327  w-4573  Production Worker
    w  d=0.353  w-1726  Machine Operator
    w  d=0.362  w-3806  Production Worker
    w  d=0.366  w-1000  Machine Operator
    w  d=0.374  w-1436  Assembler
    w  d=0.395  w-162   Machine Operator
    c  d=0.440  c-CAND-00727  C#,.NET,Azure
    c  d=0.446  c-CAND-00031  React,TypeScript,Node

The matrix indexer correctly chose the right domain — manufacturing/
warehouse roles in workers (correct semantic match for the staffing
query) rank ABOVE software-engineer candidates from the candidates
corpus. 0.11 gap between the worst worker (0.395) and the best
candidate (0.440) — clean distance separation.

Compared to the candidates-only e2e run from 0d1553c:
  candidates-only top: c-CAND-00727 at d=0.4404
  multi-corpus top:    w-4573 at d=0.3265 (a Production Worker)

That's the matrix indexer's whole point made visible: composing
domain-distinct corpora surfaces better matches than single-corpus
search. Without workers in the search space, the staffing query
returned software engineers (wrong domain). With workers, it
returns roles in the right ballpark.

What's still imperfect (signal for component 5 + future work):
  - No top-6 worker actually has "Forklift" or "OSHA-30" visible in
    metadata; "Production Worker" is semantically nearest in this
    sample. Likely needs a larger workers ingest (5000 from 500K)
    or skill-keyword boost.
  - Status/availability still not gated. The staffing-side
    structured filtering gap from 0d1553c persists; relevance filter
    (CODE-aware) doesn't address it.

Pipeline timings:
  workers ingest: 5000 rows / 19.2s = 260/sec end-to-end
  candidates ingest: 1000 rows / 3.1s = 322/sec
  multi-corpus query (text → embed → 2 parallel vectord → merge): 14ms

14-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:22:16 -05:00