Reset gateway audit substrate after /tmp wipe disabled it on reboot:
- LH_SUBJECT_AUDIT_KEY moved /tmp/lakehouse_audit/ → /etc/lakehouse/
(canonical persistent path per spec line 112; /tmp wipes on reboot
and silently disabled /audit + /biometric endpoints)
- Fresh 32B HMAC + 44-char legal token at /etc/lakehouse/, mode 0400
- Systemd drop-in updated; gateway restarted; both endpoints 200
- Pre-rotation chains for WORKER-{1..5} (backfill data) will now
tamper-detect under the new key — expected and correct on rotation
Anchor wave-table backfilled with 3 commits that landed after the
last STATE_OF_PLAY refresh on 2026-05-03 evening:
- 7e0112b: retention_sweep stray indent fix
- 848a458: Phase 1.6 Gate 5 erasure endpoint POST /biometric/.../erase
- 8ec43e0: Phase 1.6 Gate 3b deepface integration design doc
Phase 1.6 status table: Gate 5 → eng-DONE; Gate 3b → design-doc-shipped
(recommends Option C defer). Calendar bottleneck text updated.
.gitignore extended for runtime ephemera that surfaced this session:
- data/biometric/ (BIPA-quarantined photos, regulated data)
- reports/scrum/ (local-only review forensics per feedback_audit_findings_log.md)
- experiments/ (per "experiments stay out of tracked tree" policy)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
32 KiB
STATE OF PLAY — Lakehouse
Last verified: 2026-05-03 evening CDT Verified by: live probe (gateway restarted 2x, all 11 catalogd subject tests + 11 biometric tests + 6 audit tests + 4 mcp-server Gate-4 tests green; cross-runtime parity 6/6 byte-identical against live audit logs; live curl roundtrip on /biometric returned 200 + chained audit row), not memory.
Read this FIRST. When the user says "we're working on lakehouse," they mean the working code captured below — NOT what
git logframed as "the cutover" or what memory snapshots from 2 days ago suggest. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
WHAT LANDED 2026-05-03 (16 commits this wave — local-first audit substrate + Phase 1.6 BIPA gates)
The dominant work today: docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Steps 1-8 SHIPPED end-to-end + 5 of 7 Phase 1.6 BIPA pre-launch gates + 6th cross-runtime parity probe. Wave was structured as eight ship-then-scrum cycles — every wave caught real bugs, every fix wave landed within the same session.
| Commit | What | Verified |
|---|---|---|
d259909 |
catalogd: Step 1 — SubjectManifest type + Registry CRUD |
17 catalogd subject tests PASS |
d16131b |
catalogd: Step 2 — SubjectAuditWriter HMAC-SHA256 chain + per-subject Mutex + canonical-JSON via BTreeMap |
tamper-detection + concurrent-append race tests PASS |
bce6dfd |
catalogd: Step 3 — bin/backfill_subjects (BIPA-defensible defaults: vertical=unknown, consent=pending_backfill_review, retention=4yr) |
100 subjects loaded into live catalog |
fef1efd |
gateway: Step 4 — wire SubjectAuditWriter into /v1/chat tool dispatch + audit_subject_hits_in (inline, not spawn) |
tool calls log accessor.kind="gateway_lookup" |
cd8c59a |
gateway: Step 5 — AuditingWorkerLookup decorator wraps validator's WorkerLookup; spawns audit on every find() |
live /v1/validate produces audit rows |
e38f357 |
subjects Steps 1-4 fixes from cross-lineage scrum (concurrency race, schema-evolution HMAC drift, hardcoded "success" classifier) | 41 tests green |
15cfd76 |
catalogd + gateway: Step 6 — /audit/subject/{id} legal-tier endpoint with constant-time-eq token check + tampering detection |
live curl returns chain_verified=true |
2a4b316 |
subjects 2nd scrum fix wave (token min 16→32, chain_root from full chain via chain_tip(), rebuild collision warn, tightened result-state heuristic) |
17 catalogd + 6 gateway tests PASS |
8fc6238 |
catalogd: Step 7 — bin/retention_sweep (BIPA-aware on biometric clock, idempotent across daily runs, no auto-mutation) |
8 sweep tests PASS, live verified at --as-of 2031-06-01 flagging 100/100 expired |
2413c96 |
catalogd: Step 8 — bin/parity_subject_audit (Rust side of cross-runtime parity probe) |
known-answer + verify modes match Go byte-for-byte |
2222227 |
parity helper hardening (panic-noise → die() helper, abs path stripped from doc) from scrum | parity probe still 6/6 |
4708717 |
Phase 1.6 BIPA wave — Gate 4 absence test (4/4 with bypass coverage), §2 attestation script (3/3 evidence checks), Gate 1/2/5 doc scaffolds with ⚖ COUNSEL markers | 4/4 mcp-server Bun tests, 3/3 evidence on live data |
c7aa607 |
Phase 1.6 scrum fixes — schema fingerprint hashes name+type+nullable, Gate 4 catches object-literal + class-field bypasses, pyarrow dep gate, item 7 deferral rationale | 4/4 + 3/3 still pass |
f1fa6e4 |
Phase 1.6 Gate 3a — crates/catalogd/src/biometric_endpoint.rs: POST /biometric/subject/{id}/photo with consent gate, quarantined storage (mode 0700/0600), audit chain link, BiometricCollection field on SubjectManifest |
11 unit tests PASS, live roundtrip 200 |
3708e6a |
Gate 3a scrum fixes — transactional rollback on audit failure (BIPA convergent BLOCK), Content-Type parameter handling, relative data_path, ts+uuid filename, dead code removed | 11 tests + cross-runtime parity 6/6 |
7e0112b |
retention_sweep: stray indent fix on biometric_collection field | sweep tests still 8/8 |
848a458 |
Phase 1.6 Gate 5 — POST /biometric/subject/{id}/erase per BIPA destruction runbook. Two scopes (biometric_only / full); audit row appended BEFORE photo unlink so the chain has legal proof of intent even if file delete fails; manifest rolled back on audit failure. Trigger taxonomy: retention_expiry / consent_withdrawal / rtbf / court_order. |
21 unit tests (10 erasure-specific) PASS |
8ec43e0 |
Phase 1.6 Gate 3b — deepface integration design doc (Option A subprocess / Option B ONNX-in-Rust / Option C defer). Recommends C: BIPA-safest, classifications field stays None, all load-bearing surfaces (consent + audit + retention + erasure) ship without it. Forces "do we actually need classifications" to be answered by product, not spec inertia. | doc-only |
Cross-runtime parity (post-this-wave): 6 probes, 38/38 byte-identical assertions —
validator(6/6) + extract_json(12/12) + session_log(4/4) + materializer(2/2) + embed(8/8) + subject_audit(6/6).
Run cd /home/profit/golangLAKEHOUSE && for p in scripts/cutover/parity/*.sh; do bash "$p"; done to re-verify.
Three runtime-divergence classes caught + fixed by the parity probe authoring loop (cataloged because they recur):
- Go
omitemptyon string fields strips empty values that Rust serde always emits → broken HMAC time.RFC3339Nanotruncates trailing-zero nanoseconds where chronoAutoSikeeps 9 digits → broken HMAC- Go
json.MarshalHTML-escapes<>&where serde keeps literal → broken HMAC on any field with those chars
All three have regression-locked tests; structural impossibility going forward.
Phase 1.6 BIPA pre-launch gates — status table:
| Item | Status | Evidence |
|---|---|---|
| Gate 1 — public retention schedule | eng-staged, ⚖ counsel pending | docs/policies/consent/biometric_retention_schedule_v1.md |
| Gate 2 — informed consent template | eng-staged, ⚖ counsel pending | docs/policies/consent/biometric_consent_template_v1.md |
| Gate 3a — photo-upload endpoint | DONE | 11 unit tests + live POST /biometric/subject/{id}/photo |
| Gate 3b — deepface classification | design doc shipped (8ec43e0) — recommends Option C (defer); awaits product confirmation |
docs/PHASE_1_6_BIPA_GATES.md Gate 3b section |
| Gate 4 — name→ethnicity removal | DONE | mcp-server/phase_1_6_gate_4.test.ts 4/4 with bypass coverage |
| Gate 5 — destruction runbook + erasure endpoint | eng-DONE (848a458); ⚖ counsel review of runbook still pending |
docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md + POST /biometric/subject/{id}/erase (21 tests) |
| §2 cryptographic attestation | eng-DONE, signature pending | docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md (SHA-256 evidence hash, 3/3 checks pass on live data) |
| §3 employee training | deferred | conditional on operator population size |
Calendar bottleneck: counsel review of items 1/2/5-runbook/§2 attestation. Engineering long pole is Gate 3b (deepface) — design doc landed (8ec43e0); needs product confirmation that classifications are required before engineering starts. Recommendation in doc is Option C (defer) on BIPA-safety grounds.
Operational state:
LH_SUBJECT_AUDIT_KEY=/etc/lakehouse/subject_audit.key(32-byte HMAC signing key, mode 0400) loaded into systemd unit. Moved off /tmp 2026-05-05 — /tmp wipes on reboot, which on May 5 disabled/audit+/biometricendpoints (gateway fails-closed atcrates/gateway/src/main.rs:459if signing key is absent). Persistent path is per spec line 112.LH_LEGAL_AUDIT_TOKEN_FILE=/etc/lakehouse/legal_audit.token(44-char legal-tier token, mode 0400) loaded into systemd unit- Key rotation 2026-05-05: prior key was lost when /tmp wiped on reboot. New key generated at canonical path. The 5 pre-rotation audit chains for
WORKER-{1..5}(backfill data withconsent=pending_backfill_review) will tamper-detect under the new key — expected and correct behavior on key rotation, not a bug. New chain entries from 2026-05-05 forward verify cleanly. data/_catalog/subjects/holds 100 backfilledWORKER-N.jsonmanifests + per-subjectWORKER-N.audit.jsonlHMAC chainsdata/biometric/uploads/<safe_id>/<ts>_<uuid>.<ext>quarantined photo storage (mode 0700 dir / 0600 file). 2 photos uploaded for WORKER-2 during live verify./audit/subject/{id}mounted on gateway with chain_verified=true on every probe/biometric/subject/{id}/photomounted on gateway, refuses 403 withoutconsent.biometric.status="given"
WHAT LANDED 2026-05-01 → 2026-05-02 (10 commits — Lance gauntlet + cross-runtime parity wave)
| Commit | What | Verified |
|---|---|---|
5d30b3d |
lance: auto-build doc_id btree in lance_migrate handler |
doc-fetch ~5ms (was ~100ms full scan) on scale_test_10m |
044650a |
lance-bench: same scalar build post-IVF (matches gateway) | cargo check clean |
7594725 |
lance: 4-pack — sanitize_lance_err + 7 unit tests + 9-probe smoke + 10M re-bench |
smoke 9/9 PASS, tests 7/7 PASS |
98b6647 |
gateway: IterateResponse.trace_id echoed; session_log_path enabled |
parity probes see one unified JSONL |
57bde63 |
gateway: trace-id propagation + coordinator session JSONL (Rust parity with Go wave) | session_log_parity 4/4 |
ba928b1 |
aibridge: drop Python sidecar from hot path; AiClient → direct Ollama | aibridge tests 32/32 PASS, /ai/embed live 768d |
654797a |
gateway: pub extract_json + parity_extract_json bin |
extract_json_parity 12/12 |
c5654d4 |
docs: pointer to golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md |
— |
150cc3b |
aibridge: LRU embed cache, 236× RPS warm (78ms → 129us p50) | load test |
9eed982 |
mcp-server: /_go/* pass-through for G5 cutover slice | — |
6e34ef7 |
gitignore: stop tracking 100+ runtime ephemera (data/_*, lance, logs, node_modules) | untracked dropped 100+ → 0 |
41b0a99 |
chore: add 33 real items that were sitting untracked (scripts, scenarios, kimi reports, dev UIs) | clean working tree |
Cross-runtime parity (post-this-wave): 32/32 across 5 probes — validator(6/6) + extract_json(12/12) + session_log(4/4) + materializer(2/2) + embed(8/8). Run cd /home/profit/golangLAKEHOUSE && for p in scripts/cutover/parity/*.sh; do bash "$p"; done to re-verify.
Lance backend (was untested 5 days ago, now gauntlet-ready):
cargo test -p vectord-lance --release→ 7/7 PASS./scripts/lance_smoke.sh→ 9/9 PASS against live gatewayreports/lance_10m_rebench_2026-05-02.md— search warm ~20ms / cold ~46ms median, doc-fetch ~5ms post-btree
VERIFIED WORKING RIGHT NOW
The client demo (Staffing Co-Pilot)
Public URL: https://devop.live/lakehouse/ — 200, "Staffing Co-Pilot" (159 KB SPA, leaflet maps, dark theme).
Local URL: http://localhost:3700/ — same page, served by mcp-server/index.ts (PID 1271, started 09:48 CDT today).
The staffers console (the one the client was thoroughly impressed with):
https://devop.live/lakehouse/console— 200, "Lakehouse — What Your Staffing System Would Do" (26 KB)- Pulls project index via
/api/catalog/datasets(36 datasets) + playbook memory via/api/vectors/playbook_memory/stats(4,701 entries with embeddings, real ops like "fill: Maintenance Tech x2 in Milwaukee, WI")
Client-visible flow that works end-to-end on the public URL:
| Endpoint | Sample output |
|---|---|
GET /api/catalog/datasets |
36 datasets indexed: timesheets 1M, call_log 800K, workers_500k 500K, email_log 500K, workers_100k 100K, candidates 100K, placements 50K, job_orders 15K, successful_playbooks_live 2,077 |
GET /api/vectors/playbook_memory/stats |
4,701 fill operations with embeddings |
GET /system/summary |
36 datasets, 2.98M rows, 60 indexes, 500K workers loaded, 1K candidates |
POST /intelligence/staffing_forecast |
744 Production Workers needed in 30d, 11,281 bench (4,687 reliable), coverage 1,444%, risk=ok. Same for Electrician (need 32, bench 2,440) and Maintenance Tech (need 17, bench 5,004). |
POST /intelligence/permit_contracts |
permit 3442956 $500K → 3 Production Workers, 886-candidate pool, 95% fill, $36K gross. 5 more Chicago permits with 8 workers each, same pool, 95% fill, $96K each. |
POST /intelligence/market |
major Chicago permits ranked: $730M O'Hare, $615M 307 N Michigan, $580M casino, $445M Loop transit (real geo coords). |
POST /intelligence/permit_entities |
architects + contractors from permit contacts (e.g. "KACPRZYNSKI, ANDY", "SLS ELECTRICAL SERVICE"). |
POST /intelligence/activity + /intelligence/arch_signals + /intelligence/chat |
all 200 |
The demo tells the story: "upcoming Chicago contracts → workers needed → coverage from the bench → architects/contractors involved → revenue and margin." That's the "live data + anticipating contracts + complete workflow" pitch — working as of right now.
Backend, verified live this session
| Surface | State |
|---|---|
Gateway :3100 |
up, 4 providers configured, /v1/health 200 with 500K workers loaded |
MCP server :3700 (Co-Pilot demo) |
up, all /intelligence/* endpoints respond |
VCP UI :3950 |
started this session, /data/* 200, real numbers |
Observer :3800 |
ring full (2,000/2,000) — older events evicted, query Langfuse for 24h-ago state |
Sidecar :3200 |
up |
Langfuse :3001 |
recording, gw:/log + v1.chat:openrouter traces visible |
LLM Team UI :5000 |
up, only extract mode registered |
| OpenCode fleet | 40 models reachable through one sk-* key (verified live GET https://opencode.ai/zen/v1/models) |
OpenCode catalog (live):
- Claude: opus-4-7, opus-4-6, opus-4-5, opus-4-1, sonnet-4-6, sonnet-4-5, sonnet-4, haiku-4-5
- GPT-5: 5.5-pro, 5.5, 5.4-pro, 5.4, 5.4-mini, 5.4-nano, 5.3-codex-spark, 5.3-codex, 5.2, 5.2-codex, 5.1-codex-max, 5.1-codex, 5.1-codex-mini, 5.1, 5-codex, 5-nano, 5
- Gemini: 3.1-pro, 3-flash
- GLM: 5.1, 5
- Minimax: m2.7, m2.5
- Kimi: k2.6, k2.5
- Qwen: 3.6-plus, 3.5-plus
- Other: BIG-PKL (was a typo-prone name in the catalog, model id starts with "big-pkl-something")
- Free tier: minimax-m2.5-free, hy3-preview-free, ling-2.6-flash-free, trinity-large-preview-free
The substrate (frozen — do not re-architect)
- Distillation v1.0.0 at tag
e7636f2— 145/145 bun tests pass, 22/22 acceptance, 16/16 audit-full - Output:
data/_kb/distilled_{facts,procedures,config_hints}.jsonl+data/vectors/distilled_{factual,procedural,config_hint}_v20260423102847.parquet - Auditor cross-lineage: Kimi K2.6 ↔ Haiku 4.5 alternation, Opus auto-promote on diffs >100k chars, per-PR cap=3 with auto-reset on new head SHA
- Pathway memory: 88 traces, 11/11 successful replays (probation gate crossed)
- Mode runner: 5 native modes;
codereview_isolationis default; composed-corpus auto-downgrade verified Apr 26 (composed lost 5/5 vs isolation, p=0.031)
Matrix indexer
30+ live corpora including:
- 5 versions of
workers_500k_v1..v9(50K embedded chunks each) - 11 batched 2K-row shards
w500k_b3..b17 chicago_permits_v1(3,420),resumes_100k_v2(100K candidates),ethereal_workers_v1(10K)lakehouse_arch_v1(2,119),lakehouse_symbols_v1(2,470),lakehouse_answers_v1(1,269),scrum_findings_v1(1,260)kb_team_runs_v1(12,693) +kb_team_runs_agent(4,407) — LLM-team play history embeddeddistilled_factual_v20260423102507(8) — distillation output
Code health
cargo check --workspace→ 0 warnings, 0 errorsbun test auditor + tests/distillation→ 145/145 passui/server.ts+auditor.tsbundle clean
DO NOT RELITIGATE
- PR #11 is merged into
origin/mainased57eda— do not "still need to merge PR #11." - Distillation tag
distillation-v1.0.0ate7636f2is FROZEN — do not re-architect schemas, scorer rules, audit fixtures. - Kimi forensic HOLD verdict (2026-04-27) was 2/8 false + 6/8 latent — do not re-debate, see
reports/kimi/audit-last-week-full.md. candidates_safeverticalcolumn bug — fixed at catalog metadata layer in commitc3c9c21. Do not "discover" it again.- Decisions A/B/C/D from
synthetic-data-gap-report.md— all four scripts shipped today (d56f08e,940737d,c3c9c21). Do not "ask J for approval." workers_500k.phonetype fixup — already string. The fixup script is idempotent; running it is a no-op.client_workerskjkktypo dataset — was breaking every SQL query (catalog had it registered, file didn't exist). Removed viaDELETE /catalog/datasets/by-name/client_workerskjkkthis session. Do not re-add. Adding a startup gate that errors on unrecognized parquet names is the long-term fix per now.md Step 2C.- Python sidecar dropped from hot path 2026-05-02 (
ba928b1) — AiClient calls Ollama directly. Do not "wire python embedding back in."lab_ui.py+pipeline_lab.pykeep running as dev-only UIs (not on the runtime path). - Lance backend gauntlet (2026-05-02) — sanitizer over all 5 routes, 7 unit tests, 9-probe smoke, 10M re-bench. The
doc_idbtree auto-builds insidelance_migrateANDlance-bench. Do not "discover" the missing scalar index again or the leaked filesystem paths in error bodies. - Cross-runtime parity = 32/32 across 5 probes in
golangLAKEHOUSE/scripts/cutover/parity/. Do not "build a parity probe for X" without checking — validator, extract_json, session_log, materializer, and embed are all already covered. - Decisions tracker is
golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md— single living source of truth for cross-runtime decisions. As of 2026-05-02 it has 0_open_code work items; only 2 strategic items left (Lance vs Parquet+HNSW-with-spilling, Go-vs-Rust primary cutover). - PRD line 70 is load-bearing — "Everything runs locally, no cloud APIs." Yesterday's PR #13 violated this by routing customer hot-path inference to opencode/openrouter/ollama_cloud. REVERTED 2026-05-03 (
d054c0b). The customer hot path (modes.toml staffing_inference, doc_drift_check; execution_loop overseer escalation) is now local Ollama (qwen3.5:latest). Cloud providers stay configured in providers.toml for explicit dev-tool opt-in only (scrum, auditor, bot/propose). Do NOT re-add cloud models to the hot path defaults — the customer demo runs on local + free. /v1/usageshows by_provider=ollama only for any customer-shape request. If you see by_provider including kimi/openrouter/opencode/ollama_cloud for normal /v1/iterate or /v1/respond traffic, something has regressed. Verify with:curl -sf http://127.0.0.1:3100/v1/usage | jq .by_provider../scrumis a TOOL, not architecture. Lives at repo root. J runs it manually. Outputs findings todata/_kb/scrum_findings.jsonl(the KB). Findings inform development; they do NOT auto-fold into PRDs/design docs/code. If you find yourself proposing to "wire scrum into [some pipeline]" — stop. It's J's tool.- Test code in main is ACTIVELY being cleaned out. 2026-05-03 commits
6aafd41+f4ebd22removed 12 orphaned dev experiments (~2900 LOC) fromtests/real-world/andscripts/. If you find more zero-reference experimental files, surface them — don't auto-delete unless the pattern is clearly a one-time-experiment-with-zero-consumers like the ones already removed.
FIXES MADE THIS SESSION (2026-04-27 evening)
crates/gateway/src/v1/iterate.rs:93—state→_state(cleared the one cargo warning).lakehouse-ui.service(Dioxus) — disabled. Was failing 7,242 times against a missingtarget/dx/ui/debug/web/publicbuild dir.systemctl stop && disable.- VCP UI on
:3950— startedbun run ui/server.ts(PID 1162212, log/tmp/lakehouse_ui.log)./data/*endpoints now 200 with real data. client_workerskjkkcatalog entry —DELETE /catalog/datasets/by-name/client_workerskjkkremoved the dead manifest. This was the actual root cause of/system/summaryreportingworkers_500k_rows: 0and the demo showing zero bench. Every SQL query was failing schema inference on the missing file before reaching its target table. Fixed →workers_500k_rows: 500000,candidates_rows: 1000, demo coverage flipped from "critical 0%" to actual percentages on devop.live/lakehouse.
FIXES MADE THIS SESSION (2026-04-28 early — face pool)
- Synthetic StyleGAN face pool — 1000 faces, gender+race+age tagged.
scripts/staffing/fetch_face_pool.pyfetches from thispersondoesnotexist.com;scripts/staffing/tag_face_pool.py --min-age 22runs deepface and excludes minors.data/headshots/manifest.jsonlnow has gender (494 men / 458 women), race (caucasian 662 · east_asian 128 · hispanic 86 · middle_eastern 59 · black 14 · south_asian 3), age, and 48 minor exclusions. Server pool = 952 servable faces. mcp-server/index.ts:1308/headshots/:keyroute — gender×race×age intersection bucketing with graceful fallback (gender-only → all). Same key always returns same face; different keys spread evenly./headshots/_thumbs/pre-resized 384×384 webp (60× smaller: 587KB → ~11KB). Without this, 40-card grids overran Chrome's parallel-connection budget and ~75% of tiles never finished decoding. Generated via parallel ffmpeg (xargs -P 8);.gitignored.mcp-server/search.html+console.html— droppedimg.loading='lazy'. With 11KB thumbs, eager load is cheap (~500KB for 50 cards) and avoids the off-screen race that lazy decode produced.- ComfyUI on-demand uniqueness —
serve_imagegen.py:32addedseedto_cache_key()(was caching by prompt only — 3 different worker seeds collapsed to 1 cached image). Verified: seed=839185194/195/196 → 3 distinct md5s. mcp-server/index.ts:1234/headshots/generate/:key— ComfyUI hot-path that derives a deterministic-per-worker seed via djb2-style hash; cold ~1.5s, cached ~1ms. Worker prompt format:professional corporate headshot portrait of a {age}-year-old {race} {gender}, {role}, neutral expression, plain studio background, soft natural lighting, sharp focus, photorealistic, dslr. Cache atdata/headshots_gen/(gitignored, regeneratable).- Confidence-default name resolution in
search.html—genderFor()andguessEthnicityFromFirstName()lookup tables (FEMALE_NAMES, MALE_NAMES, NAMES_HISPANIC, NAMES_BLACK, NAMES_SOUTH_ASIAN, NAMES_EAST_ASIAN, NAMES_MIDDLE_EASTERN). Xavier → man+hispanic, Aisha → woman+black, etc. Every worker resolves to a face-pool bucket.
End-to-end verified: playwright run on https://devop.live/lakehouse/?q=forklift+operators+IL → 21/21 cards loaded, 0 broken, all 384×384 webp thumbs.
⚠ PRODUCTION-READY BLOCKER (2026-05-03)
Audit-trail capability is the gate to client signature. Smoke + parity tests prove the surface compiles; they do NOT prove an audit response can be produced for a specific person. Staffing client won't sign without defensible discrimination-claim response capability.
Authoritative document: docs/AUDIT_TRAIL_PRD.md — drafted 2026-05-03. Defines worked example (John Martinez at Warehouse B), the per-decision audit row schema, the surface map of where decisions happen today, current-state-vs-target gap table, and 9-phase implementation sequence.
Phase 1 (discovery walk) requires NO J approval — it's read-only. Phases 2+ have explicit open questions in §10 of the PRD that need J's call before they can start.
Until phase 9 exit criterion is met, do not claim "production-ready" on customer-facing surfaces. Internal substrate (Lance, sidecar drop, parity probes) is solid; subject-of-record audit story is not.
OPEN — but not blocking the demo
| Item | What | When to act |
|---|---|---|
modes.toml staffing_inference.matrix_corpus |
still says workers_500k_v8. v9 in vector index is from Apr 17 (raw-sourced, not safe-view). The new build_workers_v9.sh rebuilds from workers_safe. |
Run when you have 30+ min for the rebuild. |
| Open PRs #6, #7, #10 | All closed 2026-05-02 — superseded / empty / stale. PR #12 merged 2026-05-03 (a5d9070). PR #13 merged 2026-05-03 (feb638e). |
Done. |
test/enrich-prd-pipeline branch |
35 unmerged commits, includes more-evolved auditor/inference.ts (666 vs main's 580 lines), curation+fact-extractor wiring | Reconcile or formally archive — see memory/project_unmerged_architecture_work.md. |
federation-hnsw-trials stash |
Lance + S3/MinIO prototype, aws-config crate added, 708 insertions |
Phase B from EXECUTION_PLAN.md — revisit when Parquet vector ceiling actually hurts. |
candidates manifest drift |
manifest 100K vs SQL 1K. Cosmetic. | Run a metadata resync if it matters. |
RUNTIME CHEATSHEET
# Verify the demo (public + local both work)
curl -sS https://devop.live/lakehouse/ # Co-Pilot HTML
curl -sS https://devop.live/lakehouse/console # staffers console
curl -sS -X POST https://devop.live/lakehouse/intelligence/staffing_forecast \
-d '{}' -H 'content-type: application/json' \
| jq '.forecast[] | {role, demand_workers, bench_total, coverage_pct, risk}'
# Restart sequence (after Rust changes)
sudo systemctl restart lakehouse.service # gateway :3100
sudo systemctl restart lakehouse-auditor # auditor daemon
sudo systemctl restart lakehouse-observer # observer :3800
# UI bun on :3950 is NOT systemd-managed (lakehouse-ui.service is disabled).
# Restart manually: kill <pid>; nohup bun run ui/server.ts > /tmp/lakehouse_ui.log 2>&1 &
# Health checks
curl -sS http://localhost:3100/v1/health | jq # workers_count, providers
curl -sS http://localhost:3100/vectors/pathway/stats | jq
curl -sS http://localhost:3100/v1/usage | jq # since-restart cost
curl -sS http://localhost:3700/system/summary | jq # dataset counts
VISION — what we're actually building (not what's done)
J's framing for the legacy staffing company:
- Pull live data, anticipate contracts based on Chicago permits → real architect/contractor associations, headcount, time period, money, scope.
- Hybrid + memory index → search large corpora cheaply.
- Email comes in → verify against contract; SMS comes in → alert when index changes.
- Real-time.
- Invent metrics nobody else has using the hybrid index.
- Next stage: workers download an app → geolocation clock-in → automatic responsiveness measurement, no user effort, with incentives for using it.
- Find people getting certificates (passive cert tracking).
- Pull union data → bring contracts that work for employees, not just employers.
- All metrics visible, nothing hidden, value-aligned with what each side actually needs.
If a future session is shaving away from this vision toward "fix the cutover" or "land Phase X," the vision wins. Phases are scaffolding for the vision, not the goal.
CURRENT PLAN — fix the demo for the legacy staffing client
Built from playwright audit of the live demo (2026-04-27 evening). Each item ends in something the client can SEE, not internal cleanups.
Demo state is anchored by git tag demo-2026-04-27 (commit ed57eda, the merge of PR #11). To restore code state: git checkout demo-2026-04-27. To restore runtime state: DELETE /catalog/datasets/by-name/client_workerskjkk (catalog hot-fix is not in git).
P1 — Search box that actually filters (highest visible impact)
Problem: typing in #sq and pressing Enter fires POST /intelligence/chat with body {"message":"<query>"}. The state (#sst) and role (#srl) selects are ignored — never sent in the body. So every search returns a generic chat completion, never a SQL+vector hybrid filter against workers_500k. That is the "cached/generic response" the client sees.
Fix: in mcp-server/search.html, change the search-submit handler to call the real worker search endpoint with {query, state, role, top_k}. The MCP search_workers tool surface already exists; route the form there. Render returned worker rows in the existing card grid.
Done when: typing "forklift" + state IL + role "Forklift Operator" returns ≤ top_k IL Forklift Operators, and changing state to WI returns different workers.
P2 — Contractor-name click → /contractor profile page
Problem: clicking a contractor name in any rendered card stays on /lakehouse/. URL doesn't change.
Fix: wrap contractor names in <a href="/contractor?name=<encoded>">. The page mcp-server/contractor.html (14.8 KB, "Contractor Profile · Staffing Co-Pilot") already exists at /contractor and the data endpoint /intelligence/contractor_profile already returns rich data.
Then check contractor.html actually shows: full history of every record the database has on that contractor + heat map of locations underneath + relevant info (per J 2026-04-27). If the page is incomplete, finish it. Otherwise just wire the link.
Done when: clicking "KACPRZYNSKI, ANDY" opens a profile with: every Chicago permit they're contact_1 or contact_2 on, a leaflet map with markers for each address, and any matched workers from prior placements at their sites.
P3 — Substrate signal at the bottom shows the right numbers
Problem: J reports the bottom panel says "playbook memory empty, 80 traces 0 replies." Reality from the live endpoints: /api/vectors/playbook_memory/stats = 4,701 entries with embeddings; /vectors/pathway/stats = 88 traces, 11/11 replays.
Fix: find the renderer in search.html that builds the substrate signal panel; verify it's hitting the right endpoints and reading the right keys; fix shape mismatches.
Done when: bottom panel shows real numbers (4,701 playbooks, 88 traces, 11/11 replays) and references at least one specific recent operation from the playbook stats sample.
P4 — Top nav reflects today's architecture
Problem: Walkthrough/Architecture/Spec/Onboard/Alerts/Workspaces tabs all return 200 but content is from old architecture. Doesn't mention: gateway scratchpad, memory indexer, ranker, mode runner, OpenCode 40-model fleet, distillation substrate, auditor cross-lineage.
Fix: rewrite mcp-server/proof.html (or add a single new page "What's running" that replaces Architecture+Spec) to describe what's actually shipped as of demo-2026-04-27. Keep one architecture page, drop redundancy. Either complete or hide Onboard/Alerts/Workspaces — J's call which.
Done when: the architecture page tells a non-technical reader, in 2 minutes, what each piece does in coordinator-relatable terms ("intern that read every email", not "3-stage adversarial inference pipeline").
P5 — Caching for the project-index build_signal (J flagged unfinished)
Problem: "we never finished our caching for project index build signal it's not pulling new information." Need to find what build_signal refers to. Likely a scrum/auditor signal that should rebuild the lakehouse_arch_v1 corpus on commit but isn't wired to.
Fix: identify the build-signal pipeline (likely in auditor/ or crates/vectord/), wire its emit to a corpus rebuild, verify by making a test commit and watching the new chunk appear in /vectors/indexes for lakehouse_arch_v1.
Done when: committing a new file to crates/ causes lakehouse_arch_v1 chunk_count to increase within N minutes.
P0 — Anchor the demo state (DONE)
Tagged ed57eda as demo-2026-04-27. Future sessions: git checkout demo-2026-04-27 to land in this exact code state.
EXECUTION ORDER
- P1 first — biggest visible bug, ~30-60 min
- P2 next — contractor click is the second-biggest "doesn't work" the client sees, ~20 min if profile is mostly done
- P3 — small fix, big "looks alive" win
- P4 — biggest scope; might split across sessions
- P5 — feature work, only after the visible bugs are fixed
Each item commits independently with the format demo: P<n> — <one-line> so the commit log doubles as a progress journal. After each merge to main, re-tag demo-latest to point at the new HEAD.
Stop here and let J pick which item to start with. Do not silently extend scope.