lakehouse/STATE_OF_PLAY.md

# STATE OF PLAY — Lakehouse

**Last verified:** 2026-05-05 morning CDT
**Verified by:** live probe (`/audit/health` 200, `/biometric/subject/{id}/erase` 21-test substrate + `/audit/subject/{id}` legal-tier endpoint live verified against WORKER-100; new `verify_biometric_erasure.sh` + `biometric_destruction_report.sh` + `bundle_counsel_packet.sh` smoke-tested clean against live data) — not memory.

> **Read this FIRST.** When the user says "we're working on lakehouse," they mean the working code captured below — NOT what `git log` framed as "the cutover" or what memory snapshots from 2 days ago suggest. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

---

## WHAT LANDED 2026-05-05 (afternoon wave — full BIPA lifecycle endpoints + UIs + ops dashboard)

After the morning's doc-reconciliation wave (described below), the afternoon wave shipped the full **operational lifecycle** for biometric data. Phase 1.6 now has the complete end-to-end runtime: candidate consent flow → photo upload → withdrawal → retention sweep flag → erase. Five commits, all on `demo/post-pr11-polish-2026-04-28`.

| Commit | What | Verified |
|---|---|---|
| `76cb5ac` | `POST /biometric/subject/{id}/consent` — Gate 2 backend. Flips status NeverCollected→Given, computes retention_until = given_at + 18mo (per retention schedule v1 §4), captures consent_version_hash + collection_method (closed set: electronic_signature/paper/click_acceptance) + operator + evidence path to audit row. State machine: 409 already_given, 409 post-withdrawal-requires-erase, 403 subject_inactive. | 12 unit tests; live POST returns 401/400/404 on guards |
| `7f0f500` | Candidate intake UI at `/biometric/intake?candidate_id=XXX` (mcp-server :3700, exposed via nginx at devop.live/lakehouse/biometric/intake). 4-screen flow: operator auth → consent template render + click-accept → photo capture (file or webcam getUserMedia) → confirmation showing audit hmac. SHA-256 of rendered consent block becomes consent_version_hash. sessionStorage-backed token (clears on tab close), neo-brutalist style matching onboard.html. | 22631-byte HTML, mcp-server route returns 200 |
| `68d226c` | Three things in one wave: <br>(a) `POST /biometric/subject/{id}/withdraw` — BIPA right of withdrawal. Flips Given→Withdrawn, accelerates retention_until from 18mo→30d (per consent template v1 §2 SLA), audit kind=biometric_consent_withdrawal. State machine 409s on NeverCollected/Pending (nothing_to_withdraw), Withdrawn (already_withdrawn), Expired (already_expired). 12 unit tests. <br>(b) Withdraw UI at `/biometric/withdraw` — 3-screen operator flow (token+name auth → reason+evidence form → confirmation showing 30-day clock + verify curl recipe). <br>(c) `lakehouse-retention-sweep.{service,timer}` systemd units in `ops/systemd/`. Daily 03:00 UTC, Persistent=true, install.sh updated to handle paired timer+oneshot service. <br>Plus operator_of_record bug fix in intake UI (was hardcoded `'intake_ui_operator'`). | 46/46 biometric_endpoint + 71/71 catalogd lib tests; manual sweep run: 100 subjects, 0 overdue, exit 0 |
| **(current HEAD post-this-wave)** | Stats endpoint `GET /biometric/stats` (legal-tier auth, returns subject counts by status + photo count + oldest active retention + last 20 state-change events with anonymizable trace_ids) + ops dashboard at `/biometric/dashboard` (single-page, polls /stats every 15s, table + status tiles, XSS-safe DOM construction not innerHTML). Plus consent_versions allowlist: `BiometricEndpointState.allowed_consent_versions: Option<Arc<HashSet<String>>>`, loaded from `/etc/lakehouse/consent_versions.json` (`LH_CONSENT_VERSIONS_FILE` override), missing-file = permissive (v1 compat), present + populated = strict mode (refuses unknown hashes with 400 consent_version_unknown). Plus `scripts/staffing/subject_timeline.sh` — pretty-prints any subject's full BIPA lifecycle from /audit/subject/{id} (manifest + on-disk photo + chronological audit chain + verification status). | 5 new allowlist unit tests + 4 stats tests; live demo on WORKER-100 ran end-to-end (consent → photo → withdraw, chain verified=true, chain_root=a47563ff…) |

### Live demo evidence (WORKER-100, 2026-05-05 20:12 UTC)
The full lifecycle was exercised against the live gateway as a verification artifact. The audit chain on WORKER-100 now contains 3 rows:

```
20:12:33.054  BIOMETRIC_CONSENT_GRANT       result=given       hmac=9c6f4153341e97d2…  trace=live-demo-2026-05-05
20:12:47.957  BIOMETRIC_COLLECTION          result=success     hmac=856be6173c88277c…  trace=live-demo-2026-05-05
20:15:27.298  BIOMETRIC_CONSENT_WITHDRAWAL  result=withdrawn   hmac=a47563ff937d50de…  trace=live-demo-2026-05-05
```

`chain_verified=true`, chain_root = a47563ff937d50de43b09a0c903cff954233836c219a928ee8ca2aa6792272dd. Photo file at `data/biometric/uploads/WORKER-100/1778011967957907731_027b6bb1.jpg` (180 bytes — a minimal real JFIF JPEG), retention_until=2026-06-04 (= 30 days from withdrawal). Retention sweep will flag this subject on or after that date; operator runs `/biometric/subject/WORKER-100/erase` to destroy.

To re-verify: `./scripts/staffing/subject_timeline.sh WORKER-100`.

### Endpoint matrix (v1 BIPA lifecycle complete)

| Event | Endpoint | Method | Auth | Status flip | retention_until |
|---|---|---|---|---|---|
| consent given | `/biometric/subject/{id}/consent` | POST | legal | NeverCollected/Pending → Given | now + 18mo |
| photo collected | `/biometric/subject/{id}/photo` | POST | legal + consent gate | (no change) | (no change) |
| consent withdrawn | `/biometric/subject/{id}/withdraw` | POST | legal | Given → Withdrawn | now + 30d |
| destruction | `/biometric/subject/{id}/erase` | POST | legal | (manifest cleared) | n/a |
| audit read | `/audit/subject/{id}` | GET | legal | (read-only) | (read-only) |
| ops aggregates | `/biometric/stats` | GET | legal | (read-only) | (read-only) |

UIs:
- `/biometric/intake?candidate_id=X` — operator-driven consent + photo
- `/biometric/withdraw` — operator-driven withdrawal recording
- `/biometric/dashboard` — read-only ops aggregate, auto-refresh

CLI tools:
- `scripts/staffing/verify_biometric_erasure.sh <id>` — post-erasure verification
- `scripts/staffing/biometric_destruction_report.sh --month YYYY-MM` — anonymized monthly report
- `scripts/staffing/subject_timeline.sh <id>` — full lifecycle pretty-print (NEW 2026-05-05)
- `scripts/staffing/bundle_counsel_packet.sh` — counsel review tarball
- `scripts/staffing/attest_pre_identityd_biometric_state.sh` — defense attestation generator

### What's blocking production cutover NOW
**Counsel calendar.** Engineering substrate is done end-to-end: every state transition has a defensible endpoint, every endpoint has tests + live verification, every UI is reachable, retention sweep is scheduled, allowlist hardening is wired. The remaining work is signature/review:
1. Counsel review of consent template v1 (revised for Option C — classifications deferred)
2. Counsel review of retention schedule v1 (revised for Option C)
3. Counsel review of destruction runbook
4. Counsel + J signatures on §2 attestation
5. Once counsel signs the consent template, populate `/etc/lakehouse/consent_versions.json` with the signed hash to flip the gateway from permissive to strict mode

Counsel-review packet at `reports/counsel/counsel_packet_2026-05-05.tar.gz` (regenerable via `bundle_counsel_packet.sh` to pick up the latest doc state).

---

## WHAT LANDED 2026-05-05 (morning wave — doc reconciliation + Gate 3b decision + counsel packet)

This was a **doc-only wave**, not code. Background: J asked for an audit of the BIPA/biometric documentation before production cutover. Audit found moderate fragmentation between docs and shipped code (post-`identityd` collapse, post-Gate-3a-ship, pre-Gate-3b-decision). Closed it in one pass.

| Item | What changed | Status |
|---|---|---|
| **Gate 3b — DECIDED: Option C (defer classifications)** | `BiometricCollection.classifications` stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status flipped from "draft / awaits product" to "DECIDED 2026-05-05". | Locked |
| **Endpoint-path drift** | `PHASE_1_6_BIPA_GATES.md` (3 spots), `BIPA_DESTRUCTION_RUNBOOK.md` (2 spots), `biometric_retention_schedule_v1.md` (1 spot) updated from legacy `/v1/identity/subjects/*` (proposed under separate identityd daemon, never shipped) to actual `/biometric/subject/*` (catalogd-local, shipped `848a458`). Schema block in `PHASE_1_6_BIPA_GATES.md` rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate (not the proposed Postgres `subjects` table). | Reconciled |
| **Consent template + retention schedule** | Both revised for Option C: removed all "automated facial-classification" / "deepface" language so disclosed scope matches implemented scope. Pending counsel review — they were already eng-staged with ⚖ markers. | Eng-staged for counsel |
| **`scripts/staffing/verify_biometric_erasure.sh`** (NEW) | Operator-side verification of an erasure event. Curls `/audit/subject/{id}` with legal-tier token, checks: manifest.biometric_collection null, uploads dir empty, last audit row is `biometric_erasure`/`full_erasure` with `erased`/`success`, chain_verified=true. Writes a hashed report to `reports/biometric/`. | Smoke-tested live |
| **`scripts/staffing/biometric_destruction_report.sh`** (NEW) | Monthly destruction-event aggregation. Anonymizes candidate IDs (sha256-12 prefix), counts by scope + trigger, flags anomalies. Smoke-test on May 2026 data found 1 historical `biometric_erasure`/`consent_withdrawal` event (test fixture). | Smoke-tested live |
| **`docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md`** (NEW) | Captures the rotation procedure operationalized after the 2026-05-05 `/tmp` wipe incident. Covers: when to rotate, pre-rotation snapshot, atomic-swap procedure, post-rotation verification (incl. expected pre-rotation chain tamper-detect under new key), recovery from lost key, ⚖ counsel notes. | Authored |
| **`docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md`** + `bundle_counsel_packet.sh` (NEW) | Cover note bundling all eng-staged BIPA docs for counsel review with per-doc questions, sign-off checklist, recommended review sequence. Bundler script tarballs the 8 referenced files + emits a SHA-256 manifest. Tarball ready for transmission: `reports/counsel/counsel_packet_2026-05-05.tar.gz`. | Bundled, ready to send |

### Eng follow-up that this wave surfaced
- **Double-upload file leak — DIAGNOSED + FIXED** (2026-05-05 same wave). `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo file. Investigation showed:
  - The file was 13 bytes of test fixture (`ff d8 ff d9 + ASCII "TESTBYTES"`), byte-identical to the unit-test fixture at `biometric_endpoint.rs:841`. NO PII, NO biometric content, NO synthetic-face content. Came from manual integration testing on 2026-05-03.
  - Audit log timeline showed two consecutive uploads (09:54, 10:04) followed by one erasure (10:22). The erasure unlinked only the SECOND file (which the manifest pointed at by then); the first file was orphaned because the second upload had silently overwritten `manifest.data_path`.
  - **Real bug found**: the upload handler did NOT refuse a second upload to a subject with `biometric_collection.is_some()`. Patched `process_upload` to return HTTP 409 + `error: "biometric_already_collected"` when a re-upload is attempted; operator must explicitly POST `/biometric/subject/{id}/erase` first.
  - Stranded test file removed (`rm` of the 13-byte fixture).
  - New unit test `second_upload_without_erase_returns_409` asserts the 409 + that the first photo's data_path remains unchanged + that the first file remains untouched on disk.
  - Existing `repeated_uploads_grow_the_chain` replaced with `upload_erase_upload_grows_the_chain_cleanly` (covers the legitimate re-collection cycle: upload → erase → upload, chain grows to 3 rows).
  - Existing `content_type_with_parameters_accepted` test updated to use two distinct subjects (it had used one subject for two uploads to test content-type parsing — now would 409).
  - **22 biometric_endpoint tests + 59 catalogd lib tests all green** post-patch (was 21+58 pre-patch).
  - Production posture: gateway binary needs rebuild (`cargo build --release`) + `systemctl restart lakehouse.service` to pick up the new 409 behavior in live traffic.
- **Pre-rotation chain tamper-detect (expected, not a bug).** WORKER-{1..5} had pre-2026-05-05 audit chains under the prior `LH_SUBJECT_AUDIT_KEY`. Under the new key (post-`/tmp` wipe rotation), those chains correctly tamper-detect. The rotation runbook §4.4 documents this as expected; a §2.2 pre-rotation snapshot is what would prove they were intact pre-rotation if defensibility ever needs it.

### What's blocking production cutover NOW (after this wave)
- **Counsel calendar:** the four sign-off items in `COUNSEL_REVIEW_PACKET_2026-05-05.md` (retention schedule, consent template, destruction runbook, pre-identityd attestation). The packet tarball is ready; ⚖ counsel is the bottleneck.
- **Nothing else.** Engineering is no longer the long pole.

### Phase 1.6 BIPA gates — status table (this is the final post-Option-C state)

| # | Gate | Status |
|---|---|---|
| 1 | Public retention schedule | **eng-staged**, revised for Option C, ready for counsel |
| 2 | Informed consent template | **eng-staged**, revised for Option C, ready for counsel |
| 3a | Photo upload endpoint | **DONE** (shipped `f1fa6e4`, 11 unit tests, live verified) |
| 3b | Deepface classification | **DECIDED 2026-05-05: Option C (defer)** |
| 4 | Name → ethnicity inference removal | **DONE** (shipped, 4/4 mcp-server tests pass) |
| 5 | Destruction runbook + erasure endpoint | **eng-DONE** (`848a458`, 21 tests). Runbook scripts (verify + report) shipped 2026-05-05. Counsel review pending. |
| §2 | Pre-identityd attestation | **eng-DONE** (3/3 evidence checks). Awaits J + counsel signature. |
| §3 | Employee training | **deferred** (consolidated into runbook §7 acknowledgment for current operator population) |

---

## WHAT LANDED 2026-05-03 (16 commits this wave — local-first audit substrate + Phase 1.6 BIPA gates)

The dominant work today: **`docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md` Steps 1-8 SHIPPED end-to-end** + **5 of 7 Phase 1.6 BIPA pre-launch gates** + **6th cross-runtime parity probe**. Wave was structured as eight ship-then-scrum cycles — every wave caught real bugs, every fix wave landed within the same session.

| Commit | What | Verified |
|---|---|---|
| `d259909` | catalogd: Step 1 — `SubjectManifest` type + Registry CRUD | 17 catalogd subject tests PASS |
| `d16131b` | catalogd: Step 2 — `SubjectAuditWriter` HMAC-SHA256 chain + per-subject Mutex + canonical-JSON via BTreeMap | tamper-detection + concurrent-append race tests PASS |
| `bce6dfd` | catalogd: Step 3 — `bin/backfill_subjects` (BIPA-defensible defaults: vertical=unknown, consent=pending_backfill_review, retention=4yr) | 100 subjects loaded into live catalog |
| `fef1efd` | gateway: Step 4 — wire SubjectAuditWriter into `/v1/chat` tool dispatch + `audit_subject_hits_in` (inline, not spawn) | tool calls log accessor.kind="gateway_lookup" |
| `cd8c59a` | gateway: Step 5 — `AuditingWorkerLookup` decorator wraps validator's WorkerLookup; spawns audit on every find() | live `/v1/validate` produces audit rows |
| `e38f357` | subjects Steps 1-4 fixes from cross-lineage scrum (concurrency race, schema-evolution HMAC drift, hardcoded "success" classifier) | 41 tests green |
| `15cfd76` | catalogd + gateway: Step 6 — `/audit/subject/{id}` legal-tier endpoint with constant-time-eq token check + tampering detection | live curl returns chain_verified=true |
| `2a4b316` | subjects 2nd scrum fix wave (token min 16→32, chain_root from full chain via `chain_tip()`, rebuild collision warn, tightened result-state heuristic) | 17 catalogd + 6 gateway tests PASS |
| `8fc6238` | catalogd: Step 7 — `bin/retention_sweep` (BIPA-aware on biometric clock, idempotent across daily runs, no auto-mutation) | 8 sweep tests PASS, live verified at `--as-of 2031-06-01` flagging 100/100 expired |
| `2413c96` | catalogd: Step 8 — `bin/parity_subject_audit` (Rust side of cross-runtime parity probe) | known-answer + verify modes match Go byte-for-byte |
| `2222227` | parity helper hardening (panic-noise → die() helper, abs path stripped from doc) from scrum | parity probe still 6/6 |
| `4708717` | **Phase 1.6 BIPA wave** — Gate 4 absence test (4/4 with bypass coverage), §2 attestation script (3/3 evidence checks), Gate 1/2/5 doc scaffolds with ⚖ COUNSEL markers | 4/4 mcp-server Bun tests, 3/3 evidence on live data |
| `c7aa607` | Phase 1.6 scrum fixes — schema fingerprint hashes name+type+nullable, Gate 4 catches object-literal + class-field bypasses, pyarrow dep gate, item 7 deferral rationale | 4/4 + 3/3 still pass |
| `f1fa6e4` | **Phase 1.6 Gate 3a** — `crates/catalogd/src/biometric_endpoint.rs`: `POST /biometric/subject/{id}/photo` with consent gate, quarantined storage (mode 0700/0600), audit chain link, `BiometricCollection` field on SubjectManifest | 11 unit tests PASS, live roundtrip 200 |
| `3708e6a` | Gate 3a scrum fixes — transactional rollback on audit failure (BIPA convergent BLOCK), Content-Type parameter handling, relative data_path, ts+uuid filename, dead code removed | 11 tests + cross-runtime parity 6/6 |
| `7e0112b` | retention_sweep: stray indent fix on biometric_collection field | sweep tests still 8/8 |
| `848a458` | **Phase 1.6 Gate 5** — `POST /biometric/subject/{id}/erase` per BIPA destruction runbook. Two scopes (biometric_only / full); audit row appended BEFORE photo unlink so the chain has legal proof of intent even if file delete fails; manifest rolled back on audit failure. Trigger taxonomy: retention_expiry / consent_withdrawal / rtbf / court_order. | 21 unit tests (10 erasure-specific) PASS |
| `8ec43e0` | **Phase 1.6 Gate 3b** — deepface integration design doc (Option A subprocess / Option B ONNX-in-Rust / **Option C defer**). Recommends C: BIPA-safest, classifications field stays None, all load-bearing surfaces (consent + audit + retention + erasure) ship without it. Forces "do we actually need classifications" to be answered by product, not spec inertia. | doc-only |

**Cross-runtime parity (post-this-wave):** 6 probes, 38/38 byte-identical assertions —
`validator(6/6) + extract_json(12/12) + session_log(4/4) + materializer(2/2) + embed(8/8) + subject_audit(6/6)`.
Run `cd /home/profit/golangLAKEHOUSE && for p in scripts/cutover/parity/*.sh; do bash "$p"; done` to re-verify.

**Three runtime-divergence classes caught + fixed by the parity probe authoring loop** (cataloged because they recur):
1. Go `omitempty` on string fields strips empty values that Rust serde always emits → broken HMAC
2. `time.RFC3339Nano` truncates trailing-zero nanoseconds where chrono `AutoSi` keeps 9 digits → broken HMAC
3. Go `json.Marshal` HTML-escapes `<>&` where serde keeps literal → broken HMAC on any field with those chars

All three have regression-locked tests; structural impossibility going forward.

**Phase 1.6 BIPA pre-launch gates — status table:**

| Item | Status | Evidence |
|---|---|---|
| Gate 1 — public retention schedule | eng-staged, ⚖ counsel pending | `docs/policies/consent/biometric_retention_schedule_v1.md` |
| Gate 2 — informed consent template | eng-staged, ⚖ counsel pending | `docs/policies/consent/biometric_consent_template_v1.md` |
| Gate 3a — photo-upload endpoint | **DONE** | 11 unit tests + live `POST /biometric/subject/{id}/photo` |
| Gate 3b — deepface classification | **design doc shipped** (`8ec43e0`) — recommends Option C (defer); awaits product confirmation | `docs/PHASE_1_6_BIPA_GATES.md` Gate 3b section |
| Gate 4 — name→ethnicity removal | **DONE** | `mcp-server/phase_1_6_gate_4.test.ts` 4/4 with bypass coverage |
| Gate 5 — destruction runbook + erasure endpoint | **eng-DONE** (`848a458`); ⚖ counsel review of runbook still pending | `docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md` + `POST /biometric/subject/{id}/erase` (21 tests) |
| §2 cryptographic attestation | eng-DONE, signature pending | `docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md` (SHA-256 evidence hash, 3/3 checks pass on live data) |
| §3 employee training | deferred | conditional on operator population size |

**Calendar bottleneck:** counsel review of items 1/2/5-runbook/§2 attestation. Engineering long pole is Gate 3b (deepface) — design doc landed (`8ec43e0`); needs product confirmation that classifications are required before engineering starts. Recommendation in doc is Option C (defer) on BIPA-safety grounds.

**Operational state:**
- `LH_SUBJECT_AUDIT_KEY=/etc/lakehouse/subject_audit.key` (32-byte HMAC signing key, mode 0400) loaded into systemd unit. **Moved off /tmp 2026-05-05** — /tmp wipes on reboot, which on May 5 disabled `/audit` + `/biometric` endpoints (gateway fails-closed at `crates/gateway/src/main.rs:459` if signing key is absent). Persistent path is per spec line 112.
- `LH_LEGAL_AUDIT_TOKEN_FILE=/etc/lakehouse/legal_audit.token` (44-char legal-tier token, mode 0400) loaded into systemd unit
- **Key rotation 2026-05-05:** prior key was lost when /tmp wiped on reboot. New key generated at canonical path. The 5 pre-rotation audit chains for `WORKER-{1..5}` (backfill data with `consent=pending_backfill_review`) will tamper-detect under the new key — expected and correct behavior on key rotation, not a bug. New chain entries from 2026-05-05 forward verify cleanly.
- `data/_catalog/subjects/` holds 100 backfilled `WORKER-N.json` manifests + per-subject `WORKER-N.audit.jsonl` HMAC chains
- `data/biometric/uploads/<safe_id>/<ts>_<uuid>.<ext>` quarantined photo storage (mode 0700 dir / 0600 file). 2 photos uploaded for WORKER-2 during live verify.
- `/audit/subject/{id}` mounted on gateway with chain_verified=true on every probe
- `/biometric/subject/{id}/photo` mounted on gateway, refuses 403 without `consent.biometric.status="given"`

---

## WHAT LANDED 2026-05-01 → 2026-05-02 (10 commits — Lance gauntlet + cross-runtime parity wave)

| Commit | What | Verified |
|---|---|---|
| `5d30b3d` | lance: auto-build doc_id btree in `lance_migrate` handler | doc-fetch ~5ms (was ~100ms full scan) on scale_test_10m |
| `044650a` | lance-bench: same scalar build post-IVF (matches gateway) | cargo check clean |
| `7594725` | lance: 4-pack — `sanitize_lance_err` + 7 unit tests + 9-probe smoke + 10M re-bench | smoke 9/9 PASS, tests 7/7 PASS |
| `98b6647` | gateway: `IterateResponse.trace_id` echoed; session_log_path enabled | parity probes see one unified JSONL |
| `57bde63` | gateway: trace-id propagation + coordinator session JSONL (Rust parity with Go wave) | session_log_parity 4/4 |
| `ba928b1` | aibridge: drop Python sidecar from hot path; AiClient → direct Ollama | aibridge tests 32/32 PASS, /ai/embed live 768d |
| `654797a` | gateway: pub `extract_json` + `parity_extract_json` bin | extract_json_parity 12/12 |
| `c5654d4` | docs: pointer to `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md` | — |
| `150cc3b` | aibridge: LRU embed cache, 236× RPS warm (78ms → 129us p50) | load test |
| `9eed982` | mcp-server: /_go/* pass-through for G5 cutover slice | — |
| `6e34ef7` | gitignore: stop tracking 100+ runtime ephemera (data/_*, lance, logs, node_modules) | untracked dropped 100+ → 0 |
| `41b0a99` | chore: add 33 real items that were sitting untracked (scripts, scenarios, kimi reports, dev UIs) | clean working tree |

**Cross-runtime parity (post-this-wave):** 32/32 across 5 probes — `validator(6/6) + extract_json(12/12) + session_log(4/4) + materializer(2/2) + embed(8/8)`. Run `cd /home/profit/golangLAKEHOUSE && for p in scripts/cutover/parity/*.sh; do bash "$p"; done` to re-verify.

**Lance backend (was untested 5 days ago, now gauntlet-ready):**
- `cargo test -p vectord-lance --release` → 7/7 PASS
- `./scripts/lance_smoke.sh` → 9/9 PASS against live gateway
- `reports/lance_10m_rebench_2026-05-02.md` — search warm ~20ms / cold ~46ms median, doc-fetch ~5ms post-btree

---

## VERIFIED WORKING RIGHT NOW

### The client demo (Staffing Co-Pilot)

**Public URL:** `https://devop.live/lakehouse/` — 200, "Staffing Co-Pilot" (159 KB SPA, leaflet maps, dark theme).
**Local URL:** `http://localhost:3700/` — same page, served by `mcp-server/index.ts` (PID 1271, started 09:48 CDT today).

**The staffers console** (the one the client was thoroughly impressed with):
- `https://devop.live/lakehouse/console` — 200, "Lakehouse — What Your Staffing System Would Do" (26 KB)
- Pulls project index via `/api/catalog/datasets` (36 datasets) + playbook memory via `/api/vectors/playbook_memory/stats` (4,701 entries with embeddings, real ops like *"fill: Maintenance Tech x2 in Milwaukee, WI"*)

Client-visible flow that works end-to-end on the public URL:

| Endpoint | Sample output |
|---|---|
| `GET /api/catalog/datasets` | 36 datasets indexed: timesheets 1M, call_log 800K, workers_500k 500K, email_log 500K, workers_100k 100K, candidates 100K, placements 50K, job_orders 15K, successful_playbooks_live 2,077 |
| `GET /api/vectors/playbook_memory/stats` | 4,701 fill operations with embeddings |
| `GET /system/summary` | 36 datasets, 2.98M rows, 60 indexes, 500K workers loaded, 1K candidates |
| `POST /intelligence/staffing_forecast` | 744 Production Workers needed in 30d, 11,281 bench (4,687 reliable), coverage 1,444%, risk=ok. Same for Electrician (need 32, bench 2,440) and Maintenance Tech (need 17, bench 5,004). |
| `POST /intelligence/permit_contracts` | permit `3442956` $500K → 3 Production Workers, 886-candidate pool, 95% fill, $36K gross. 5 more Chicago permits with 8 workers each, same pool, 95% fill, $96K each. |
| `POST /intelligence/market` | major Chicago permits ranked: $730M O'Hare, $615M 307 N Michigan, $580M casino, $445M Loop transit (real geo coords). |
| `POST /intelligence/permit_entities` | architects + contractors from permit contacts (e.g. "KACPRZYNSKI, ANDY", "SLS ELECTRICAL SERVICE"). |
| `POST /intelligence/activity` + `/intelligence/arch_signals` + `/intelligence/chat` | all 200 |

The demo tells the story: *"upcoming Chicago contracts → workers needed → coverage from the bench → architects/contractors involved → revenue and margin."* That's the "live data + anticipating contracts + complete workflow" pitch — working as of right now.

### Backend, verified live this session

| Surface | State |
|---|---|
| Gateway `:3100` | up, 4 providers configured, `/v1/health` 200 with 500K workers loaded |
| MCP server `:3700` (Co-Pilot demo) | up, all `/intelligence/*` endpoints respond |
| VCP UI `:3950` | started this session, `/data/*` 200, real numbers |
| Observer `:3800` | ring full (2,000/2,000) — older events evicted, query Langfuse for 24h-ago state |
| Sidecar `:3200` | up |
| Langfuse `:3001` | recording, `gw:/log` + `v1.chat:openrouter` traces visible |
| LLM Team UI `:5000` | up, only `extract` mode registered |
| OpenCode fleet | **40 models reachable through one `sk-*` key** (verified live `GET https://opencode.ai/zen/v1/models`) |

OpenCode catalog (live):
- Claude: opus-4-7, opus-4-6, opus-4-5, opus-4-1, sonnet-4-6, sonnet-4-5, sonnet-4, haiku-4-5
- GPT-5: 5.5-pro, 5.5, 5.4-pro, 5.4, 5.4-mini, 5.4-nano, 5.3-codex-spark, 5.3-codex, 5.2, 5.2-codex, 5.1-codex-max, 5.1-codex, 5.1-codex-mini, 5.1, 5-codex, 5-nano, 5
- Gemini: 3.1-pro, 3-flash
- GLM: 5.1, 5
- Minimax: m2.7, m2.5
- Kimi: k2.6, k2.5
- Qwen: 3.6-plus, 3.5-plus
- Other: BIG-PKL (was a typo-prone name in the catalog, model id starts with "big-pkl-something")
- Free tier: minimax-m2.5-free, hy3-preview-free, ling-2.6-flash-free, trinity-large-preview-free

### The substrate (frozen — do not re-architect)

- Distillation v1.0.0 at tag `e7636f2` — **145/145 bun tests pass, 22/22 acceptance, 16/16 audit-full**
- Output: `data/_kb/distilled_{facts,procedures,config_hints}.jsonl` + `data/vectors/distilled_{factual,procedural,config_hint}_v20260423102847.parquet`
- Auditor cross-lineage: Kimi K2.6 ↔ Haiku 4.5 alternation, Opus auto-promote on diffs >100k chars, **per-PR cap=3 with auto-reset on new head SHA**
- Pathway memory: 88 traces, 11/11 successful replays (probation gate crossed)
- Mode runner: 5 native modes; `codereview_isolation` is default; composed-corpus auto-downgrade verified Apr 26 (composed lost 5/5 vs isolation, p=0.031)

### Matrix indexer

30+ live corpora including:
- 5 versions of `workers_500k_v1..v9` (50K embedded chunks each)
- 11 batched 2K-row shards `w500k_b3..b17`
- `chicago_permits_v1` (3,420), `resumes_100k_v2` (100K candidates), `ethereal_workers_v1` (10K)
- `lakehouse_arch_v1` (2,119), `lakehouse_symbols_v1` (2,470), `lakehouse_answers_v1` (1,269), `scrum_findings_v1` (1,260)
- `kb_team_runs_v1` (12,693) + `kb_team_runs_agent` (4,407) — LLM-team play history embedded
- `distilled_factual_v20260423102507` (8) — distillation output

### Code health

- `cargo check --workspace` → **0 warnings, 0 errors**
- `bun test auditor + tests/distillation` → **145/145 pass**
- `ui/server.ts` + `auditor.ts` bundle clean

---

## DO NOT RELITIGATE

- **PR #11 is merged into `origin/main` as `ed57eda`** — do not "still need to merge PR #11."
- **Distillation tag `distillation-v1.0.0` at `e7636f2` is FROZEN** — do not re-architect schemas, scorer rules, audit fixtures.
- **Kimi forensic HOLD verdict (2026-04-27) was 2/8 false + 6/8 latent** — do not re-debate, see `reports/kimi/audit-last-week-full.md`.
- **`candidates_safe` `vertical` column bug** — fixed at catalog metadata layer in commit `c3c9c21`. Do not "discover" it again.
- **Decisions A/B/C/D from `synthetic-data-gap-report.md`** — all four scripts shipped today (`d56f08e`, `940737d`, `c3c9c21`). Do not "ask J for approval."
- **`workers_500k.phone` type fixup** — already string. The fixup script is idempotent; running it is a no-op.
- **`client_workerskjkk` typo dataset** — was breaking every SQL query (catalog had it registered, file didn't exist). Removed via `DELETE /catalog/datasets/by-name/client_workerskjkk` this session. Do not re-add. Adding a startup gate that errors on unrecognized parquet names is the long-term fix per now.md Step 2C.
- **Python sidecar dropped from hot path 2026-05-02 (`ba928b1`)** — AiClient calls Ollama directly. Do not "wire python embedding back in." `lab_ui.py` + `pipeline_lab.py` keep running as dev-only UIs (not on the runtime path).
- **Lance backend gauntlet (2026-05-02)** — sanitizer over all 5 routes, 7 unit tests, 9-probe smoke, 10M re-bench. The `doc_id` btree auto-builds inside `lance_migrate` AND `lance-bench`. Do not "discover" the missing scalar index again or the leaked filesystem paths in error bodies.
- **Cross-runtime parity = 32/32** across 5 probes in `golangLAKEHOUSE/scripts/cutover/parity/`. Do not "build a parity probe for X" without checking — validator, extract_json, session_log, materializer, and embed are all already covered.
- **Decisions tracker is `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md`** — single living source of truth for cross-runtime decisions. As of 2026-05-02 it has 0 `_open_` code work items; only 2 strategic items left (Lance vs Parquet+HNSW-with-spilling, Go-vs-Rust primary cutover).
- **PRD line 70 is load-bearing — "Everything runs locally, no cloud APIs."** Yesterday's PR #13 violated this by routing customer hot-path inference to opencode/openrouter/ollama_cloud. **REVERTED 2026-05-03 (`d054c0b`).** The customer hot path (modes.toml staffing_inference, doc_drift_check; execution_loop overseer escalation) is now local Ollama (qwen3.5:latest). Cloud providers stay configured in providers.toml for **explicit dev-tool opt-in only** (scrum, auditor, bot/propose). Do NOT re-add cloud models to the hot path defaults — the customer demo runs on local + free.
- **`/v1/usage` shows by_provider=ollama only** for any customer-shape request. If you see by_provider including kimi/openrouter/opencode/ollama_cloud for normal /v1/iterate or /v1/respond traffic, something has regressed. Verify with: `curl -sf http://127.0.0.1:3100/v1/usage | jq .by_provider`.
- **`./scrum` is a TOOL, not architecture.** Lives at repo root. J runs it manually. Outputs findings to `data/_kb/scrum_findings.jsonl` (the KB). Findings inform development; they do NOT auto-fold into PRDs/design docs/code. If you find yourself proposing to "wire scrum into [some pipeline]" — stop. It's J's tool.
- **Test code in main is ACTIVELY being cleaned out.** 2026-05-03 commits `6aafd41` + `f4ebd22` removed 12 orphaned dev experiments (~2900 LOC) from `tests/real-world/` and `scripts/`. If you find more zero-reference experimental files, surface them — don't auto-delete unless the pattern is clearly a one-time-experiment-with-zero-consumers like the ones already removed.

---

## FIXES MADE THIS SESSION (2026-04-27 evening)

1. **`crates/gateway/src/v1/iterate.rs:93`** — `state` → `_state` (cleared the one cargo warning).
2. **`lakehouse-ui.service` (Dioxus)** — disabled. Was failing 7,242 times against a missing `target/dx/ui/debug/web/public` build dir. `systemctl stop && disable`.
3. **VCP UI on `:3950`** — started `bun run ui/server.ts` (PID 1162212, log `/tmp/lakehouse_ui.log`). `/data/*` endpoints now 200 with real data.
4. **`client_workerskjkk` catalog entry** — `DELETE /catalog/datasets/by-name/client_workerskjkk` removed the dead manifest. **This was the actual root cause** of `/system/summary` reporting `workers_500k_rows: 0` and the demo showing zero bench. Every SQL query was failing schema inference on the missing file before reaching its target table. Fixed → `workers_500k_rows: 500000`, `candidates_rows: 1000`, demo coverage flipped from "critical 0%" to actual percentages on devop.live/lakehouse.

## FIXES MADE THIS SESSION (2026-04-28 early — face pool)

5. **Synthetic StyleGAN face pool — 1000 faces, gender+race+age tagged.** `scripts/staffing/fetch_face_pool.py` fetches from thispersondoesnotexist.com; `scripts/staffing/tag_face_pool.py --min-age 22` runs deepface and excludes minors. `data/headshots/manifest.jsonl` now has gender (494 men / 458 women), race (caucasian 662 · east_asian 128 · hispanic 86 · middle_eastern 59 · black 14 · south_asian 3), age, and 48 minor exclusions. Server pool = 952 servable faces.
6. **`mcp-server/index.ts:1308` `/headshots/:key` route** — gender×race×age intersection bucketing with graceful fallback (gender-only → all). Same key always returns same face; different keys spread evenly.
7. **`/headshots/_thumbs/` pre-resized 384×384 webp** (60× smaller: 587KB → ~11KB). Without this, 40-card grids overran Chrome's parallel-connection budget and ~75% of tiles never finished decoding. Generated via parallel ffmpeg (`xargs -P 8`); `.gitignore`d.
8. **`mcp-server/search.html` + `console.html`** — dropped `img.loading='lazy'`. With 11KB thumbs, eager load is cheap (~500KB for 50 cards) and avoids the off-screen race that lazy decode produced.
9. **ComfyUI on-demand uniqueness — `serve_imagegen.py:32`** added `seed` to `_cache_key()` (was caching by prompt only — 3 different worker seeds collapsed to 1 cached image). Verified: seed=839185194/195/196 → 3 distinct md5s.
10. **`mcp-server/index.ts:1234` `/headshots/generate/:key`** — ComfyUI hot-path that derives a deterministic-per-worker seed via djb2-style hash; cold ~1.5s, cached ~1ms. Worker prompt format: `professional corporate headshot portrait of a {age}-year-old {race} {gender}, {role}, neutral expression, plain studio background, soft natural lighting, sharp focus, photorealistic, dslr`. Cache at `data/headshots_gen/` (gitignored, regeneratable).
11. **Confidence-default name resolution** in `search.html` — `genderFor()` and `guessEthnicityFromFirstName()` lookup tables (FEMALE_NAMES, MALE_NAMES, NAMES_HISPANIC, NAMES_BLACK, NAMES_SOUTH_ASIAN, NAMES_EAST_ASIAN, NAMES_MIDDLE_EASTERN). Xavier → man+hispanic, Aisha → woman+black, etc. Every worker resolves to a face-pool bucket.

End-to-end verified: playwright run on `https://devop.live/lakehouse/?q=forklift+operators+IL` → 21/21 cards loaded, 0 broken, all 384×384 webp thumbs.

---

## ⚠ PRODUCTION-READY BLOCKER (2026-05-03)

**Audit-trail capability is the gate to client signature.** Smoke + parity tests prove the surface compiles; they do NOT prove an audit response can be produced for a specific person. Staffing client won't sign without defensible discrimination-claim response capability.

**Authoritative document:** `docs/AUDIT_TRAIL_PRD.md` — drafted 2026-05-03. Defines worked example (John Martinez at Warehouse B), the per-decision audit row schema, the surface map of where decisions happen today, current-state-vs-target gap table, and 9-phase implementation sequence.

**Phase 1 (discovery walk) requires NO J approval — it's read-only.** Phases 2+ have explicit open questions in §10 of the PRD that need J's call before they can start.

Until phase 9 exit criterion is met, **do not claim "production-ready" on customer-facing surfaces.** Internal substrate (Lance, sidecar drop, parity probes) is solid; subject-of-record audit story is not.

---

## OPEN — but not blocking the demo

| Item | What | When to act |
|---|---|---|
| `modes.toml` `staffing_inference.matrix_corpus` | still says `workers_500k_v8`. v9 in vector index is from Apr 17 (raw-sourced, not safe-view). The new `build_workers_v9.sh` rebuilds from `workers_safe`. | Run when you have 30+ min for the rebuild. |
| Open PRs #6, #7, #10 | All closed 2026-05-02 — superseded / empty / stale. PR #12 merged 2026-05-03 (`a5d9070`). PR #13 merged 2026-05-03 (`feb638e`). | Done. |
| `test/enrich-prd-pipeline` branch | 35 unmerged commits, includes more-evolved auditor/inference.ts (666 vs main's 580 lines), curation+fact-extractor wiring | Reconcile or formally archive — see `memory/project_unmerged_architecture_work.md`. |
| `federation-hnsw-trials` stash | Lance + S3/MinIO prototype, `aws-config` crate added, 708 insertions | Phase B from EXECUTION_PLAN.md — revisit when Parquet vector ceiling actually hurts. |
| `candidates` manifest drift | manifest 100K vs SQL 1K. Cosmetic. | Run a metadata resync if it matters. |

---

## RUNTIME CHEATSHEET

```bash
# Verify the demo (public + local both work)
curl -sS https://devop.live/lakehouse/                   # Co-Pilot HTML
curl -sS https://devop.live/lakehouse/console            # staffers console
curl -sS -X POST https://devop.live/lakehouse/intelligence/staffing_forecast \
  -d '{}' -H 'content-type: application/json' \
  | jq '.forecast[] | {role, demand_workers, bench_total, coverage_pct, risk}'

# Restart sequence (after Rust changes)
sudo systemctl restart lakehouse.service                 # gateway :3100
sudo systemctl restart lakehouse-auditor                 # auditor daemon
sudo systemctl restart lakehouse-observer                # observer :3800
# UI bun on :3950 is NOT systemd-managed (lakehouse-ui.service is disabled).
# Restart manually:  kill <pid>; nohup bun run ui/server.ts > /tmp/lakehouse_ui.log 2>&1 &

# Health checks
curl -sS http://localhost:3100/v1/health | jq            # workers_count, providers
curl -sS http://localhost:3100/vectors/pathway/stats | jq
curl -sS http://localhost:3100/v1/usage | jq             # since-restart cost
curl -sS http://localhost:3700/system/summary | jq       # dataset counts
```

---

## VISION — what we're actually building (not what's done)

J's framing for the legacy staffing company:

- Pull live data, anticipate contracts based on Chicago permits → real architect/contractor associations, headcount, time period, money, scope.
- Hybrid + memory index → search large corpora cheaply.
- Email comes in → verify against contract; SMS comes in → alert when index changes.
- Real-time.
- Invent metrics nobody else has using the hybrid index.
- Next stage: workers download an app → geolocation clock-in → automatic responsiveness measurement, no user effort, with incentives for using it.
- Find people getting certificates (passive cert tracking).
- Pull union data → bring contracts that work for **employees**, not just employers.
- All metrics visible, nothing hidden, value-aligned with what each side actually needs.

If a future session is shaving away from this vision toward "fix the cutover" or "land Phase X," the vision wins. Phases are scaffolding for the vision, not the goal.

---

## CURRENT PLAN — fix the demo for the legacy staffing client

Built from playwright audit of the live demo (2026-04-27 evening). Each item ends in something the client can SEE, not internal cleanups.

**Demo state is anchored by git tag `demo-2026-04-27`** (commit `ed57eda`, the merge of PR #11). To restore code state: `git checkout demo-2026-04-27`. To restore runtime state: `DELETE /catalog/datasets/by-name/client_workerskjkk` (catalog hot-fix is not in git).

### P1 — Search box that actually filters (highest visible impact)

**Problem:** typing in `#sq` and pressing Enter fires `POST /intelligence/chat` with body `{"message":"<query>"}`. The state (`#sst`) and role (`#srl`) selects are ignored — never sent in the body. So every search returns a generic chat completion, never a SQL+vector hybrid filter against `workers_500k`. That is the "cached/generic response" the client sees.

**Fix:** in `mcp-server/search.html`, change the search-submit handler to call the real worker search endpoint with `{query, state, role, top_k}`. The MCP `search_workers` tool surface already exists; route the form there. Render returned worker rows in the existing card grid.

**Done when:** typing "forklift" + state IL + role "Forklift Operator" returns ≤ top_k IL Forklift Operators, and changing state to WI returns different workers.

### P2 — Contractor-name click → `/contractor` profile page

**Problem:** clicking a contractor name in any rendered card stays on `/lakehouse/`. URL doesn't change.

**Fix:** wrap contractor names in `<a href="/contractor?name=<encoded>">`. The page `mcp-server/contractor.html` (14.8 KB, "Contractor Profile · Staffing Co-Pilot") already exists at `/contractor` and the data endpoint `/intelligence/contractor_profile` already returns rich data.

**Then check contractor.html actually shows:** full history of every record the database has on that contractor + heat map of locations underneath + relevant info (per J 2026-04-27). If the page is incomplete, finish it. Otherwise just wire the link.

**Done when:** clicking "KACPRZYNSKI, ANDY" opens a profile with: every Chicago permit they're contact_1 or contact_2 on, a leaflet map with markers for each address, and any matched workers from prior placements at their sites.

### P3 — Substrate signal at the bottom shows the right numbers

**Problem:** J reports the bottom panel says "playbook memory empty, 80 traces 0 replies." Reality from the live endpoints: `/api/vectors/playbook_memory/stats` = 4,701 entries with embeddings; `/vectors/pathway/stats` = 88 traces, 11/11 replays.

**Fix:** find the renderer in search.html that builds the substrate signal panel; verify it's hitting the right endpoints and reading the right keys; fix shape mismatches.

**Done when:** bottom panel shows real numbers (4,701 playbooks, 88 traces, 11/11 replays) and references at least one specific recent operation from the playbook stats sample.

### P4 — Top nav reflects today's architecture

**Problem:** Walkthrough/Architecture/Spec/Onboard/Alerts/Workspaces tabs all return 200 but content is from old architecture. Doesn't mention: gateway scratchpad, memory indexer, ranker, mode runner, OpenCode 40-model fleet, distillation substrate, auditor cross-lineage.

**Fix:** rewrite `mcp-server/proof.html` (or add a single new page "What's running" that replaces Architecture+Spec) to describe what's actually shipped as of `demo-2026-04-27`. Keep one architecture page, drop redundancy. Either complete or hide Onboard/Alerts/Workspaces — J's call which.

**Done when:** the architecture page tells a non-technical reader, in 2 minutes, what each piece does in coordinator-relatable terms ("intern that read every email", not "3-stage adversarial inference pipeline").

### P5 — Caching for the project-index build_signal (J flagged unfinished)

**Problem:** "we never finished our caching for project index build signal it's not pulling new information." Need to find what `build_signal` refers to. Likely a scrum/auditor signal that should rebuild the `lakehouse_arch_v1` corpus on commit but isn't wired to.

**Fix:** identify the build-signal pipeline (likely in `auditor/` or `crates/vectord/`), wire its emit to a corpus rebuild, verify by making a test commit and watching the new chunk appear in `/vectors/indexes` for `lakehouse_arch_v1`.

**Done when:** committing a new file to `crates/` causes `lakehouse_arch_v1` chunk_count to increase within N minutes.

### P0 — Anchor the demo state (DONE)

Tagged `ed57eda` as `demo-2026-04-27`. Future sessions: `git checkout demo-2026-04-27` to land in this exact code state.

---

## EXECUTION ORDER

1. **P1 first** — biggest visible bug, ~30-60 min
2. **P2 next** — contractor click is the second-biggest "doesn't work" the client sees, ~20 min if profile is mostly done
3. **P3** — small fix, big "looks alive" win
4. **P4** — biggest scope; might split across sessions
5. **P5** — feature work, only after the visible bugs are fixed

Each item commits independently with the format `demo: P<n> — <one-line>` so the commit log doubles as a progress journal. After each merge to main, re-tag `demo-latest` to point at the new HEAD.

Stop here and let J pick which item to start with. Do not silently extend scope.