fetch_face_pool was wiping 952 hand-classified rows when re-run from
a Python without deepface installed (it reset every gender to None).
Now:
- Loads existing manifest by id and overlays only fetch-owned fields,
so gender/race/age/excluded survive a refetch.
- deepface pass tags only records that don't already have a gender;
deepface unavailable means "leave existing tags alone" not "reset".
- New --shrink flag required to drop ids >= --count. Default refuses
to shrink the pool silently.
- Atomic write via tmp + os.replace so an interrupted run can't
corrupt the manifest.
- Dedupes duplicate id lines (root cause of the 2497-row manifest
backing a 1000-face pool).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker cards now ship a real photo per person instead of monogram tiles:
- fetch_face_pool.py pulls 1000 faces from thispersondoesnotexist.com
- tag_face_pool.py runs deepface for gender/race/age, excludes <22yo
- manifest.jsonl: 952 servable, gender/race buckets populated
- /headshots/_thumbs/ pre-resized to 384px webp (587KB -> 11KB,
60x smaller; without this Chrome's parallel-connection budget
drops ~75% of tiles in a 40-card grid)
- /headshots/:key gender x race x age intersection bucketing with
gender-only fallback when intersection is sparse
- /headshots/generate/:key ComfyUI on-demand for the contractor
profile spotlight (cold ~1.5s, cached ~1ms; worker-derived
djb2 seed makes faces deterministic-per-worker but unique
across workers sharing the same prompt)
- serve_imagegen.py _cache_key() now includes seed (was caching
by prompt only -> 3 different worker seeds collapsed to 1
cached image; verified fix produces 3 distinct md5s)
- confidence-default name resolution: Xavier->man+hispanic,
Aisha->woman+black, etc. Every worker resolves to a bucket.
End-to-end: playwright run on /?q=forklift+operators+IL -> 21/21
cards loaded, 0 broken, all 384px webp.
Cache + binary pool gitignored; manifest tracked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three layers shipped:
1. SCRIPT — scripts/staffing/fetch_face_pool.py
Pulls N synthetic StyleGAN faces from thispersondoesnotexist.com
into data/headshots/face_NNNN.jpg, writes manifest.jsonl. Idempotent:
re-running skips existing files. Optional gender tagging via deepface
(currently unavailable on this box; the script handles ImportError
gracefully and tags everything as untagged). Fetched 198 faces with
concurrency=3 in ~67s.
2. SERVER — /headshots/:key route in mcp-server/index.ts
Loads manifest at first hit, caches in globalThis._faces. Hashes the
key with djb2-style mixing → pool index → returns the JPG. Same
key always gets the same face (deterministic). Accepts
?g=man|woman&e=caucasian|black|hispanic|south_asian|east_asian|middle_eastern
to bias pool selection — the gender/ethnicity buckets fall back to
the full pool when no tagged matches exist. Cache-Control:
86400 immutable so faces ride the browser cache after first hit.
/headshots/__reload re-reads the manifest without restart.
3. UI — search.html + console.html worker cards
Re-added overlay <img> on top of the monogram .av circle. img.src
= /headshots/<encoded-key>?g=<hint>&e=<hint>. img.onerror removes
the failed image so the monogram stays visible if the face pool
isn't fetched / CDN is blocked. .av now has overflow:hidden +
position:relative to clip the img to a perfect circle.
Forced-confident name resolution (J: "we're CREATING the profile,
created as though you truly have the information Xavier is more
likely Hispanic and he's a male"):
genderFor(name) — looks up MALE_NAMES + FEMALE_NAMES,
falls back to a deterministic hash split
so unknown names spread ~50/50. Sets now
include cross-cultural names: Alejandro/
Andres/Mateo/Santiago/Joaquin/Cesar/Hugo/
Felipe/Gerardo/Salvador/Ramon (Hispanic),
Raj/Anil/Vikram/Krishna/Pradeep (South
Asian), Wei/Yi/Hiroshi/Akira/Hyun (East
Asian), Demetrius/Kareem/DaQuan/Khalil
(Black), Omar/Khalid/Hassan/Ahmed/Bilal
(Middle Eastern). FEMALE_NAMES extended
in parallel.
guessEthnicityFromFirstName(name)
— confident default of 'caucasian' for any
name not in the cultural buckets so every
worker resolves to a category the face
pool can be biased toward. Order: ME → Black
→ Hispanic → South Asian → East Asian →
Caucasian (matters where names overlap,
e.g. Aisha appears in ME + Black, biases
toward ME for visual fit).
Both helpers also ported into console.html so the triage backfills
and try-it-yourself rendering get the same hint stack.
Privacy note in the script + route comments: the synthetic data uses
the worker's name as the seed; production should hash worker_id (not
name) to avoid leaking PII to a third-party CDN. The fetch URL itself
is referenced once per pool build, not per-worker.
.gitignore — added data/headshots/face_*.jpg (~100MB for 198 faces;
the manifest + script are tracked). Re-running the script on a fresh
checkout rebuilds the pool from scratch.
Verified end-to-end via playwright on devop.live/lakehouse:
forklift query → 10 worker cards
10/10 with face images (real synthetic headshots, not monograms)
0/10 broken
Alejandro G. Nelson → ?g=man&e=hispanic
Patricia K. Garcia → ?g=woman&e=caucasian
Each name → unique face, deterministic across loads.
Console triage backfills get the same treatment.
Two load-bearing runtime changes that were never committed:
1. crates/gateway/src/v1/iterate.rs — `state` → `_state` on the unused
route-state parameter. Cleared the one cargo workspace warning.
Fix was made earlier this session but the working-tree change
never made it into a commit.
2. data/_catalog/manifests/564b00ae-cbf3-4efd-aa55-84cdb6d2b0b7.json —
DELETED. This was the dead manifest for `client_workerskjkk`, a
typo dataset whose parquet was deleted but whose catalog entry
stayed registered. Every SQL query failed schema inference on the
missing file before reaching its target table — that's the bug
that made /system/summary report 0 workers and the demo show zero
bench. Deleting the manifest keeps the fix on disk; committing
the deletion keeps it in git so a fresh checkout doesn't regress.
3. data/_catalog/manifests/32ee74a0-59b4-4e5b-8edb-70c9347a4bf3.json
— runtime catalog metadata update from the successful_playbooks_live
write path. Ride-along change.
Reports under reports/distillation/phase[68]-*.md are auto-regenerated
by the audit cycle each run; skipping those.
lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):"
Decision B from reports/staffing/synthetic-data-gap-report.md §7
(plus C: client_workerskjkk.parquet typo file removed from
data/datasets/ — was never tracked, no git effect).
PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus
staffing_inference mode embeds chunks from). Verified 2026-04-27 by
inspecting data/vectors/meta/workers_500k_v8.json — `source:
"workers_500k"` confirms v8 was built directly from the raw table, so
the LLM has been seeing names / emails / phones / resume_text for every
staffing query.
This commit closes the boundary at the catalog metadata layer:
candidates_safe (overhauled — was failing SQL invalid 434×/day on a
nonexistent `vertical` column reference, copy-pasted from job_orders):
drops last_name, email, phone, hourly_rate_usd
candidate_id masked (keep first 3, last 2)
row_filter: status != 'blocked'
workers_safe (NEW):
drops name, email, phone, zip, communications, resume_text
keeps role, city, state, skills, certifications, archetype, scores
resume_text + communications carry verbatim PII (full names) and
there is no in-view text scrubber, so they are dropped wholesale.
Skills + certifications + scores carry the matching signal for
staffing inference.
jobs_safe (NEW):
drops description (often quotes client names verbatim)
client_id masked (keep first 3, last 2)
bill_rate / pay_rate kept — commercial info, not PII per staffing PRD
scripts/staffing/build_workers_v9.sh (NEW):
POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe`
rather than the raw table. Embedded text is constructed from the
view projection so PII never enters the corpus by construction.
30+ minute background job — not run inline. After it completes,
flip config/modes.toml `staffing_inference` matrix_corpus from
workers_500k_v8 to workers_500k_v9 and restart gateway.
Distillation v1.0.0 substrate untouched. audit-full passed clean
(16/16 required) before this commit; will re-verify after.
Decision A from reports/staffing/synthetic-data-gap-report.md §7.
Walks tests/multi-agent/scenarios/scen_*.json and
data/_playbook_lessons/*.json, normalizes to a single fill_events.parquet
at data/datasets/fill_events.parquet. One row per scenario event,
lesson outcomes joined by (client, date) where the tuple matches.
rows: 123
scenarios contributing: 40
events with outcome data: 62
unique (client, date) tuples: 40
Reproducibility: event_id is SHA1(client|date|role|at|city) truncated to
16 hex chars; rows sorted by event_id before write so re-runs produce
bit-identical output. Verified.
Pure normalization — no LLM, no new data, no distillation substrate
mutation.
10,000 staffing worker profiles from profit/ethereal repo. Flattened
JSON → CSV → Parquet. Indexed on HNSW (9.5s) + Lance IVF_PQ (7.2s).
SQL hybrid verified: forklift operators in IL with reliability > 0.8
returned exact matches. Vector search alone missed the state filter —
confirms the hybrid SQL+vector routing need from quality eval.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RAG pipeline now includes a cross-encoder rerank step between retrieval
and generation. The LLM re-sorts top-K results by relevance before
they become context. Falls back to original order if model output is
unparseable (~5% with 7B models). Also improved the generation prompt
to be domain-aware ("staffing database") and request specific citations.
Fixed 4 catalog manifests with bucket="data" (pre-federation leftover)
that poisoned the entire DataFusion query context on startup. The
"users", "lab_trials", "meta_runs", and "new_candidates" datasets
now correctly reference bucket="primary". This bug was surfaced by
the quality evaluation pipeline — wouldn't have been found by
structural tests alone.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clean_sql now uses 3 strategies in priority order:
1. Extract from ```sql...``` markdown blocks
2. Find first SELECT/WITH/INSERT statement in text
3. Strip leading "sql" keyword fallback
Tested against 5 real model output patterns:
- Clean SQL ✓
- "sql" prefixed ✓
- Markdown fenced ✓
- Explanation before ```sql block ✓
- Explanation with SELECT buried in text ✓
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- POST /ingest/postgres/tables — list all tables in a database
- POST /ingest/postgres/import — import table → Parquet → catalog → queryable
- Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean,
text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010)
- Auto PII detection + lineage on import
- Empty password support for trust auth
- Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols)
from local knowledge_base Postgres database — immediately queryable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable
- Polls every 10 seconds (configurable)
- Processed files moved to ./inbox/processed/
- Failed files moved to ./inbox/failed/
- Dedup: same file dropped twice = no-op
- Watcher starts automatically on gateway boot
- Tested: CSV dropped → queryable in <15s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AccessControl: agent roles with allowed sensitivity levels
- 4 default roles: admin (all), recruiter (PII ok), analyst (financial ok), agent (internal only)
- Field-level masking: determines which columns to mask per agent based on sensitivity
- Query audit log: tracks every query with agent, datasets, PII fields accessed
- Endpoints: GET/POST /access/roles, GET /access/audit, POST /access/check
- Toggleable via config (auth.enabled)
- 100K embedding: supervisor now sustained 125/sec (2.9x vs single pipeline)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 4 parallel pipelines on i9 + A4000 via Ollama
- Previous single-pipeline: 43/sec, 39min for 100K
- Supervisor: 67.6/sec, 22min for 100K
- Previous 100K attempt failed at 97K (no retry) — supervisor handles this
- Checkpointing every 1000 chunks for crash recovery
- Round-robin retry on batch failure (3 attempts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JobTracker: create/update/complete/fail jobs with progress tracking
- POST /vectors/index now returns immediately with job_id (HTTP 202)
- Embedding runs in tokio::spawn background task
- GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA)
- GET /vectors/jobs lists all jobs
- Progress logged every 100 batches with chunks/sec and ETA
- 100K embedding job running successfully at 44 chunks/sec
- System stays responsive during embedding (queries in 23ms)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers
- Saved searches: agent stores SQL queries in workspace context
- Shortlist: tag candidates/records to a workspace with notes
- Activity log: track calls, emails, updates per workspace per agent
- Instant handoff: transfer workspace ownership with full history
Zero data copy — just a pointer swap, receiving agent sees everything
- Persistence: workspaces stored as JSON in object storage, rebuilt on startup
- Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search,
/{id}/shortlist, /{id}/activity
- Tested: Sarah creates workspace, saves searches, shortlists 3 candidates,
logs activity, hands off to Mike who continues seamlessly
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB)
Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files for row-level updates
Write deltas without rewriting base files, merge at query time
- Compaction: POST /query/compact merges deltas into base Parquet
- Query engine: checks cache first, falls back to Parquet, merges deltas
- Benchmarked on 2.47M rows:
1M row JOIN: 854ms cold → 96ms hot (8.9x speedup)
100K filter: 62ms cold → 21ms hot (3x speedup)
1.1M rows cached in 408MB RAM
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>