Ran cross-lineage scrum on the discovery doc with the new model fleet
(opus + kimi-k2.6 + gemini-3-flash via Go gateway :4110, custom
"senior security architect" prompt). 3/3 reviewers responded with
substantive 800-1200 word reviews. Saved at /tmp/audit_scrum/.
5 convergent findings (≥2 reviewers) added as §10/C1-C5:
C1. §1F matrix-indexer "good for audit defensibility" claim is over-
claimed — walked back in TL;DR. Trace bodies unverified; treat as
SUSPECTED PII sink until §8.1 sampling completes.
C2. §1E (Langfuse) is the most dangerous leak — fix FIRST, ahead of
view-routing. Boundary-crossing leak (GDPR Art. 44 / CPRA sale /
SOC2 disposal). All 3 reviewers converge on this priority.
C3. Discrimination defense requires the FULL CANDIDATE POOL, not just
fills. EEOC UGESP (1978): need adverse-impact stats on everyone
who could have been picked. Phase 1 worked example missed this.
C4. BIPA / biometric exposure understated in findings (in PRD §10.5
but not translated to actionables). $1k-$5k per-violation regime.
C5. candidate_id must be promoted to top-level field in all JSONL
sinks. Grepping natural-language strings is not defensible audit
strategy. 3/3 reviewers converge.
11 single-reviewer high-value catches added as §10 single-reviewer
section: opus on LLM provider egress (8th PII path), Art. 22 right-
to-explanation, special-category data, DPIA/ROPA/DPA inventory; kimi
on sequential ID enumeration risk, Langfuse retention config, CCPA
de-identified-in-place vs crypto-shred, Bun common-mode failure,
cryptographic audit-trail integrity (Merkle/FRE 901), HIPAA BAA,
revised SELECT * effort estimate; gemini on data residency, "culture
fit" reasoning proxies, comparator-pool snapshot.
§9 reordered: sample first → defense-layer second → Langfuse
boundary third (was view-routing first per original draft;
boundary-crossing leak is higher priority per scrum).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
296 lines
34 KiB
Markdown
296 lines
34 KiB
Markdown
# Phase 1 Discovery — Subject + PII Surface Map
|
|
|
|
**Status:** Draft — 2026-05-03 · **Drafted by:** working session 2026-05-03 · **Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md)
|
|
|
|
> **Purpose.** Read-only walk of both runtimes. Fills in the "UNKNOWN" cells in `AUDIT_TRAIL_PRD.md` §3 surface map and §7 current-state-vs-target table with file:line evidence. **No code changes.** Output: a complete picture of where subject identifiers + PII flow today, where they leak, and what an audit response could and could not produce right now.
|
|
|
|
---
|
|
|
|
## TL;DR — what the walk found
|
|
|
|
1. **A defense layer EXISTS but is BYPASSED.** Two PII-masked views (`candidates_safe`, `workers_safe`) live in `data/_catalog/views/` and correctly drop name/email/phone. **No production tool query uses them.** The MCP tool registry's `search_candidates` and `get_candidate` SQL templates query the raw `candidates` / `workers_500k` tables and return full PII to the LLM context. Worth considering: the views are functioning policy artifacts that nobody routes through.
|
|
|
|
2. **PII flows freely through the LLM substrate.** In a single fill scenario, candidate names + emails + phones traverse: SQL → tool_result → LogEntry → execution_loop log → `/v1/respond` HTTP response → Langfuse trace input + output → `data/_kb/outcomes.jsonl` (fills with name) → `data/_kb/overseer_corrections.jsonl` (operation + correction text). At least 7 distinct persistence/transmission paths per scenario.
|
|
|
|
3. **`candidate_id` is a stable token but not an isolated one.** It's a column on the same `workers_500k.parquet` that holds name/email/phone — there is no separate identity service. Joining candidate_id back to PII is one DataFusion query away. The audit-trail PRD's §5 identity-service intent is not yet built; the architectural separation does not exist.
|
|
|
|
4. **No `/audit/subject/{id}` endpoint exists.** Reconstructing "every decision about person X" today requires manual cross-correlation across 4+ JSONL files + Langfuse + pathway memory + observer events. There is no canonical query path; the audit response cannot be produced in any reasonable time today.
|
|
|
|
5. **Go side mirrors the Rust pattern.** Go validator's `rosterRow` carries Name; Go SessionRecord carries `Prompt` (truncated to 4000 chars) which contains the natural-language operation including candidate names. Cross-runtime parity in the PII-leak too.
|
|
|
|
6. **Append-only persistence is universal.** `outcomes.jsonl`, `overseer_corrections.jsonl`, `pathway_memory/state.json`, `sessions.jsonl`, Langfuse — all append-only. Right-to-be-forgotten under the current architecture requires the cryptographic-erasure approach from PRD §6 because *no* hot data store supports per-subject deletion today.
|
|
|
|
7. **The matrix-indexer fingerprint is subject-agnostic; trace bodies are UNVERIFIED.** `pathway_memory::PathwayTrace` fingerprints are keyed by `task_class + file_prefix + signal_class` — none of those are subject identifiers, which is structurally defensive. **However:** trace bodies (`reducer_summary`, `final_verdict`) are written from execution-loop output and are highly likely to leak PII. Per §8.1 these are unverified. Treat the matrix indexer as a SUSPECTED PII SINK until sampled — do NOT rely on "matrix can't drive discrimination" framing. (Walked back from earlier draft per cross-lineage scrum §10/C1: that claim was over-stated in 3/3 reviews.)
|
|
|
|
---
|
|
|
|
## §1 — PII flow paths in the Rust stack (file:line evidence)
|
|
|
|
### 1A — MCP tools query raw tables (PII enters the LLM context)
|
|
|
|
| File:line | What | PII columns returned |
|
|
|---|---|---|
|
|
| `crates/gateway/src/tools/registry.rs:96` | `search_candidates` SQL template | `first_name, last_name, phone, email, city, state` |
|
|
| `crates/gateway/src/tools/registry.rs:107` | `get_candidate` SQL template | `SELECT *` — returns ALL columns including hourly_rate_usd, address, anything in the table |
|
|
| `crates/gateway/src/tools/registry.rs:129` | `top_recruiters` SQL | recruiter names (employee PII, not candidate but still PII) |
|
|
| `crates/gateway/src/tools/registry.rs:141` | `engaged_unplaced_candidates` SQL | `c.candidate_id, c.first_name, c.last_name, c.phone, c.vertical` |
|
|
| `mcp-server/index.ts:1722, 2173, 2269` | address concatenation (`street_number + street_direction + street_name`) | full street addresses surfaced to MCP UI clients |
|
|
|
|
**Architectural gap:** the MCP tool registry has no "view-only" mode. Adding `_safe` view routing would be one `sql_template` rewrite per tool. **Today: every fill scenario invokes search_candidates → LLM sees full PII for every candidate in the result set.**
|
|
|
|
### 1B — LLM tool results captured in execution log
|
|
|
|
| File:line | What |
|
|
|---|---|
|
|
| `crates/gateway/src/execution_loop/mod.rs:30-39` | `LogEntry` struct: `{turn, role, model, kind, content: Value, at}`. `content` is the raw tool_result body (carries PII when kind=`tool_result`). |
|
|
| `crates/gateway/src/execution_loop/mod.rs:217` | `self.append(LogEntry::new(..., "tool_result", trimmed))` — every tool result appended to `self.log` |
|
|
| `crates/gateway/src/execution_loop/mod.rs:1319` | per-render cap: `kind == "tool_result"` gets 1200 char cap (vs 200 for other kinds). Trimming is for token economy, **not PII redaction** — names/emails fit in 1200 chars. |
|
|
| `crates/gateway/src/execution_loop/mod.rs:30-50` | `Fill` struct (in same file): `{candidate_id: String, name: String, reason: Option<String>}`. The `name` field is load-bearing per the PRD-defined FillProposal artifact contract. |
|
|
|
|
### 1C — Execution log returned to HTTP caller
|
|
|
|
| File:line | What |
|
|
|---|---|
|
|
| `crates/gateway/src/v1/respond.rs:111-119` | `RespondResponse { status, artifact, log: outcome.into_log(), iterations, error }` — full log array returned in JSON response body to whoever called `POST /v1/respond` |
|
|
| `crates/gateway/src/execution_loop/mod.rs:82-93` | `RespondOutcome::{Ok,Failed,Blocked}` all carry `log: Vec<LogEntry>` — no path drops the log on its way to the caller |
|
|
|
|
**Audit implication:** any HTTP client calling `/v1/respond` today receives full PII in the response. Authentication on `/v1/respond` is the only access control. Authorization is binary (either you can call it or you can't); there is no row-level filtering of which candidates you're allowed to see in the response.
|
|
|
|
### 1D — Persisted JSONL sinks on disk
|
|
|
|
| File:line | Path written | PII shape |
|
|
|---|---|---|
|
|
| `crates/gateway/src/execution_loop/mod.rs:508` | `data/_kb/outcomes.jsonl` (append) | `{operation, fills: [{candidate_id, name, reason?}], error, ...}` — observed live: `operation` carries natural-language fill request including role + city + state; `fills` carries the full Fill struct with name. |
|
|
| `crates/gateway/src/execution_loop/mod.rs:701` | `data/_kb/overseer_corrections.jsonl` (append) | `{operation, correction, sig_hash, ...}` — `correction` is the overseer's free-text guidance which often references specific candidates by name (e.g. "the executor picked Emily Garcia who is at fill capacity, try Maria Rodriguez instead") |
|
|
| `mcp-server/observer.ts:138-139` | `data/_observer/ops.jsonl` (append) | observer event log; PII content depends on what the operation context included. **Empty on this box right now** (file doesn't exist), but the writer is wired. |
|
|
| `mcp-server/observer.ts:161` | `data/_kb/observer_escalations.jsonl` (append) | sampled live: technical analysis fields (no PII observed in current rows), but `analysis` field is free-form LLM output and could include names depending on the escalation trigger. |
|
|
| `crates/gateway/src/v1/session_log.rs` (writer) → `/tmp/lakehouse-validator/sessions.jsonl` | per `lakehouse.toml` `[gateway].session_log_path` | `SessionRecord { ..., prompt, attempts: [{... raw}], artifact }` — **`prompt` carries the operation text (PII), `raw` carries each model attempt's raw text (PII when model reasoned about specific candidates), `artifact` carries the final FillProposal `{candidate_id, name}` shape on success.** |
|
|
|
|
### 1E — Langfuse external persistence
|
|
|
|
| File:line | What gets sent |
|
|
|---|---|
|
|
| `crates/gateway/src/v1/langfuse_trace.rs:166` | `body.input.messages = ev.input` — full message array (system + user + assistant + tool messages). Tool messages contain raw PII for tool_result entries. |
|
|
| `crates/gateway/src/v1/langfuse_trace.rs:191` | `body.output = ev.output` — full model generation. When the model emits a Fill or reasons about candidates, names appear here. |
|
|
|
|
Langfuse runs at `:3001` per memory, with credentials in `/etc/lakehouse/langfuse.env`. The Langfuse storage tier holds these traces. **PII leaves the lakehouse infrastructure boundary at this point** — Langfuse's storage layer (Postgres + ClickHouse, default config) holds it.
|
|
|
|
### 1F — Pathway memory
|
|
|
|
| File:line | What |
|
|
|---|---|
|
|
| `crates/vectord/src/pathway_memory.rs:169` | `PathwayTrace` struct definition |
|
|
| `data/_pathway_memory/state.json` | Persisted state — **need read** to confirm whether real names appear in traces. |
|
|
|
|
Pathway traces are keyed by `pathway_id = SHA256(task_class + file_prefix + signal_class)` — none of which are subject identifiers. **The fingerprint is structurally subject-agnostic.** Whether the trace BODY (kb_chunks, observer_signals, reducer_summary, final_verdict) leaks PII depends on what those fields contain at write time. From the struct shape: `reducer_summary` and `final_verdict` are strings written from execution loop output — **highly likely to leak PII** when summarizing fill outcomes. Need to confirm against live state.json.
|
|
|
|
---
|
|
|
|
## §2 — Defense layer that exists but is bypassed
|
|
|
|
`data/_catalog/views/candidates_safe.json`:
|
|
- Drops: `last_name, email, phone, hourly_rate_usd`
|
|
- Masks: `candidate_id` (keep first 3, last 2 chars — `CAND-...01` shape)
|
|
- Row filter: `status != 'blocked'`
|
|
- Visibility intent (per `description`): "Visible to recruiter / mode-runner agents."
|
|
|
|
`data/_catalog/views/workers_safe.json`:
|
|
- Drops: `name, email, phone, zip, communications, resume_text`
|
|
- Reason given: "resume_text + communications carry verbatim PII (full names) and there's no in-view text scrubber, so they're dropped wholesale"
|
|
- Source for the rebuilt `workers_500k_v9` vector corpus
|
|
|
|
**The `_safe` views are the right policy artifact.** The ARCHITECTURAL gap is at the SQL-template layer in `crates/gateway/src/tools/registry.rs` — the LLM-facing tools query the RAW tables, not the SAFE views. Three out of three candidate-touching tool templates (search_candidates, get_candidate, engaged_unplaced_candidates) bypass the safe view.
|
|
|
|
There is no enforcement preventing this. There is no test that asserts "tool SQL must reference only `_safe` tables." There is no warning logged when raw tables are queried via tool surface.
|
|
|
|
**Quick fix shape (NOT a phase 1 deliverable, just noting the change shape):** rewrite the tool sql_templates to `FROM candidates_safe` (or `FROM workers_safe`); add a build-time check that `crates/gateway/src/tools/registry.rs` only references `*_safe` tables; add a runtime gate in queryd that refuses LLM-attributed queries on raw tables. These would land in Phase 4 (subject tagging across substrates) or possibly Phase 1.5 (defense-layer enforcement) per the AUDIT_TRAIL_PRD.
|
|
|
|
---
|
|
|
|
## §3 — Identity / candidate_id provenance
|
|
|
|
| Question | Finding |
|
|
|---|---|
|
|
| Is `candidate_id` a stable token? | Yes — observed format `CAND-NNNNNN` (e.g. `CAND-000001`). Stable across the table. Used as join key in the schema (per `crates/ui/src/main.rs:160` cross-table join hints). |
|
|
| Is the candidate_id ↔ PII mapping in a separate service? | **No.** Both live in `data/datasets/workers_500k.parquet`. A SQL `SELECT name, email FROM candidates WHERE candidate_id = '...'` resolves the mapping in one query. |
|
|
| Is the mapping itself audited? | **No.** No log records "who looked up the PII for which candidate when." This is the identity-service gap from PRD §5. |
|
|
| Does anything ever generate a *different* token (UUID, opaque hash) instead of using candidate_id directly? | **No** — every tool, every validator, every persistence sink uses candidate_id as-is. |
|
|
|
|
**Production-ready implication:** the staffing client's lawyer asks "show me the access log for candidate X's PII" → we cannot produce one. Every access happens via SQL against `workers_500k`; no row-level access log exists.
|
|
|
|
---
|
|
|
|
## §4 — Go-side parity with the leak surface
|
|
|
|
| Rust file:line | Equivalent Go file:line | Same shape? |
|
|
|---|---|---|
|
|
| `crates/validator/src/lib.rs` (FillProposal carries `{candidate_id, name}`) | `internal/validator/fill.go` + `internal/validator/lookup_jsonl.go:23-31` (`rosterRow` carries `CandidateID, Name, Status, City, State, Role, BlacklistedClients`) | Yes — same PII shape |
|
|
| `crates/gateway/src/v1/session_log.rs` SessionRecord with prompt/artifact | `internal/validator/session_log.go` SessionRecord with `Prompt string`, `Attempts[].Raw string`, `Artifact map[string]any` | Yes — same shape |
|
|
| `outcomes.jsonl` / `overseer_corrections.jsonl` writers | Not present on Go side (Go validatord doesn't currently run the execution_loop with tool dispatch — Phase 4 of Go rewrite is what wires that) | **Asymmetric** — Go side writes less but only because feature parity isn't done |
|
|
| MCP tools registry | Not on Go side yet (mcp-server is still Bun) | **Bun is the surface** for tool dispatch in both runtimes today |
|
|
|
|
**Cross-runtime audit implication:** even though our 5 parity probes (validator/extract_json/session_log/materializer/embed) all pass 32/32, **none of them assert PII handling**. A new `pii_parity.sh` probe would feed identical PII-tagged input through both runtimes and assert identical redaction behavior. As of today, neither runtime *does* redaction at the substrate level, so the probe would just confirm "both leak identically."
|
|
|
|
---
|
|
|
|
## §5 — Mapping back to AUDIT_TRAIL_PRD §3 surface map
|
|
|
|
Updated cells with file:line evidence:
|
|
|
|
| Decision happens at | Currently logged where | Audit-completeness gap (revised) |
|
|
|---|---|---|
|
|
| Ingestion (candidate added to pool) | `data/datasets/workers_*.parquet` rows; ingestd writes via `crates/ingestd/`; no per-subject "added at" event journal found | **GAP** — no subject-tagged ingest event. When + who + how a candidate entered is not auditable per-subject. |
|
|
| Embedding creation | `crates/aibridge/src/client.rs` (post-sidecar-drop direct path); LRU cache `crates/aibridge/src/cache.rs` (commit `150cc3b`); cached entries are keyed by `(model, text)` — **the text is the cached body, which for candidate embeddings IS PII (candidate name + role + skills appear in source text per workers_500k_v9 build SQL).** Cache itself is in-memory, not persisted, but text-as-key means PII is in process memory. | **MAJOR GAP** — no per-embedding audit row. Cache key contains PII. No subject tagging means we cannot answer "what embedding was generated for candidate X." |
|
|
| Search inclusion | Tool result entries in execution log + Langfuse + outcomes.jsonl `fills` field | Partial — fills are logged, but you have to grep across outcomes.jsonl + Langfuse for any mention of a candidate_id. No "all search results that included X" canonical view. |
|
|
| Search rank | Result set in chat traces (Langfuse), not indexed by candidate | Partial — Langfuse trace has the result set; no inversion (candidate → ranks-received). |
|
|
| Fill recommendation | `outcomes.jsonl` `fills` array with name + candidate_id | Partial — present, but mixed with PII in the natural-language `operation` field. |
|
|
| Validation outcome | `sessions.jsonl` per `[gateway].session_log_path` config; per-attempt `verdict_kind`, `error` | Partial — works per-session, not per-subject. To find "all validations that touched candidate X" you'd need to grep the JSONL by candidate_id. |
|
|
| Iterate retry escalations | Same sessions.jsonl `attempts[]` array | Same as above |
|
|
| Observer signals | `data/_observer/ops.jsonl` (file not present today on this box, writer is wired) | UNKNOWN content shape until the writer fires for a fill scenario — needs verification |
|
|
| Matrix-indexer compounding | `data/_pathway_memory/state.json` — fingerprinted by code, NOT by subject | **NO subject leak in fingerprint structure** (good for audit defensibility). Trace bodies (`reducer_summary`, `final_verdict`) MAY carry PII — needs sampling to confirm. |
|
|
|
|
---
|
|
|
|
## §6 — Mapping back to AUDIT_TRAIL_PRD §7 current-state-vs-target gap table
|
|
|
|
| Capability | Phase 1 finding | Status |
|
|
|---|---|---|
|
|
| candidate_id as canonical token | Stable `CAND-NNNNNN` format. Same column lives in same parquet as PII. | Token exists; isolation does not. |
|
|
| Identity service | Doesn't exist. PII + candidate_id co-located in `workers_500k.parquet`. | Real gap — needs new service. |
|
|
| `/audit/subject/{id}` endpoint | Doesn't exist. No audit route in `crates/gateway/src/v1/mod.rs` route listing. | Real gap — needs new endpoint. |
|
|
| Subject-tagged embeddings | LRU cache `(model, text)` keys; text contains PII; no audit row per embed. | Real gap. |
|
|
| Subject-tagged search results | Langfuse trace contains result set; not subject-indexed. | Real gap (queryability, not capture). |
|
|
| Subject-tagged validation outcomes | Yes — sessions.jsonl carries candidate_id (in the attempt's `raw` field) but not as a queryable top-level field. | Partial — needs subject_id top-level promotion. |
|
|
| Subject-tagged matrix indexer entries | Pathway fingerprints are subject-agnostic by design. Trace bodies may leak. | Decision needed (PRD open question 7) — keep code-only OR risk PII surface. |
|
|
| Protected-attribute filter at decision time | **Not enforced anywhere.** SQL templates return whatever columns they SELECT; no protected-attribute removal at the gateway boundary. The `candidates` table schema in the demo SQL includes age (years_experience is a proxy), and the call_log/email_log tables likely contain free-text correspondence. | Real gap — major. Requires both schema audit + boundary enforcement. |
|
|
| Retention policy | None enforced. Append-only files grow indefinitely. | Real gap. |
|
|
| Right to be forgotten | Not implementable today on append-only logs without cryptographic erasure. | Per PRD §6 design, real engineering needed. |
|
|
| Cross-runtime parity | 5 algorithm probes pass; 0 audit/PII probes exist. | Probe set needs extension. |
|
|
|
|
---
|
|
|
|
## §7 — Worked example: John Martinez audit trail TODAY (negative result)
|
|
|
|
If John Martinez (candidate_id `CAND-042195`) requests an audit:
|
|
|
|
1. **Find his candidate_id.** Manual SQL: `SELECT candidate_id FROM candidates WHERE first_name='John' AND last_name='Martinez'` — returns 1+ rows. Already a leak: someone with SQL access can correlate name → candidate_id ad-hoc.
|
|
|
|
2. **Find every fill scenario that included him.** `grep "CAND-042195" data/_kb/outcomes.jsonl` — returns rows where his fill was included. But the row's natural-language `operation` field may NOT contain his candidate_id (it's the fill request, e.g. "fill: Welder x2 in Toledo, OH"), so we'd miss scenarios where he was a *candidate* but not a *fill*. To find those, we'd need to grep Langfuse traces (off-box) or the per-tool result content in execution logs (which aren't persisted as separate JSONL — they live in the HTTP response that already left the building).
|
|
|
|
3. **Find every validation that touched him.** Grep `/tmp/lakehouse-validator/sessions.jsonl` for `CAND-042195` — would catch FillValidator phantom-ID rejections AND successful fills, but the candidate_id would appear in either the prompt or the raw attempt text, not as a top-level queryable field.
|
|
|
|
4. **Find every embedding generated for him.** Cache is in-memory; nothing persisted. Cannot answer.
|
|
|
|
5. **Find every search result that ranked him.** Off-box in Langfuse. Untrieable in lakehouse without a separate Langfuse query pipeline (which doesn't exist).
|
|
|
|
6. **Find pathway memory traces involving him.** Pathway fingerprints don't carry his ID. The `reducer_summary` strings might mention him (need to grep state.json) but the fingerprint search wouldn't surface them.
|
|
|
|
7. **Show what protected attributes were exposed to the model.** No record of input_features per decision — the LLM saw whatever the SQL returned, and we have no per-decision input_features audit row.
|
|
|
|
8. **Format the output for legal.** Even if we collected all the above, there's no signing, no integrity hash, no schema, no template.
|
|
|
|
**Estimated time to produce a complete-and-defensible response today: not possible.** Estimated time to produce an INCOMPLETE response by cobbling JSONLs + Langfuse exports: 2-5 hours per request, manual, error-prone, and the response would over-share (other candidates appearing in the same fill scenarios) AND under-share (embedding events, search rankings, pathway traces missed).
|
|
|
|
This is the Phase 1 result the AUDIT_TRAIL_PRD predicted: today's substrate is not production-ready for a discrimination-defense audit response.
|
|
|
|
---
|
|
|
|
## §8 — What this discovery DID NOT cover
|
|
|
|
Phase 1 was scoped to file:line evidence + sampling of live JSONL state. The following deserve their own subsequent walks before phase 2+ design:
|
|
|
|
1. **Live sample of `data/_pathway_memory/state.json`** — confirm whether `reducer_summary` / `final_verdict` strings actually leak PII or stay generic. Read 3-5 traces and grep for names from `workers_500k`.
|
|
2. **Live sample of Langfuse traces** — confirm input message PII for a real fill scenario from past 7 days. Use Langfuse `:3001` query API.
|
|
3. **observerd event content** — the writer is wired but `data/_observer/ops.jsonl` doesn't exist on this box. Trigger a fill scenario and inspect the resulting event.
|
|
4. **Bun mcp-server tool dispatch** — does the Bun server log tool calls anywhere? `mcp-server/index.ts` is 2900+ lines; partial walk only here.
|
|
5. **bot/propose.ts** — the bot proposal flow likely touches candidates; not walked in this pass.
|
|
6. **`crates/journald` mutation log** — designed for row-level mutations per ADR-012; haven't confirmed whether candidate-table mutations land here with PII.
|
|
7. **Go side observerd + chatd PII surface** — the Go cmd/* binaries likely have analogous logging; confirmed validator parity but didn't walk observer/chat logging.
|
|
8. **Process memory + crash dumps** — if the gateway dumps core, what PII is in it? Out of scope for code walk; comes up in security audit.
|
|
9. **Operator runbook** — who has access to logs/, /tmp/lakehouse-validator/, MinIO buckets, Langfuse Postgres? Out of scope for code walk; comes up in operational security review.
|
|
|
|
These are listed so the next phase doesn't accidentally re-walk what's done OR skip what wasn't covered.
|
|
|
|
---
|
|
|
|
## §9 — Recommended next moves (not commitments)
|
|
|
|
Per AUDIT_TRAIL_PRD §8, Phase 2 is the identity service design doc. Before that doc gets written, the cheapest high-value moves discovered here:
|
|
|
|
1. **Defense-layer enforcement (1-2 hours work).** Rewrite the 3 tool SQL templates in `crates/gateway/src/tools/registry.rs` to use `_safe` views. Add a unit test that asserts no `crates/gateway/src/tools/registry.rs` template references `FROM candidates ` or `FROM workers_500k ` (only `*_safe`). This is one commit. It prevents the most-trafficked PII leak path TODAY without waiting on the identity service. **Cost:** the LLM sees masked candidate_ids (`CAN...95` instead of `CAND-042195`); some downstream tools (validator existence check) would need a "resolve to full ID" path that goes through the identity service — but that's exactly the architectural shape PRD §5 wants anyway.
|
|
|
|
2. **Sample state.json + Langfuse before phase 2 design.** §8.1 + §8.2 above. ~30 minutes. Either confirms or refutes the "matrix indexer is subject-clean" finding from §1F.
|
|
|
|
3. **Document the Bun mcp-server tool surface.** §8.4. The Bun layer is a major PII transit point not fully covered here.
|
|
|
|
4. **Identify whether protected attributes (age proxies, photo features, zip-code → race correlations) are currently in any tool-returned column.** Schema-level audit of `candidates` + `workers_500k`. ~1 hour. Might surface that some "neutral" columns are actually protected-attribute proxies.
|
|
|
|
These four moves give the phase-2 design doc strong evidence to lean on. None are commitments — J's call on what to do next.
|
|
|
|
---
|
|
|
|
## §10 — Cross-lineage scrum review of this discovery doc (2026-05-03)
|
|
|
|
After the discovery walk, this document was reviewed by three independent model lineages via `/v1/chat` against the Go gateway (post-PR-#13 model fleet): **opus** (`opencode/claude-opus-4-7`), **kimi** (`ollama_cloud/kimi-k2.6`), **gemini** (`ollama_cloud/gemini-3-flash-preview`). Custom prompt: senior security architect reviewing a discovery report. (DeepSeek-v3.2 timed out; not included.)
|
|
|
|
Verbatim reviews saved at `/tmp/audit_scrum/{opus,kimi,gemini}_review.md`. Convergent findings (≥2 reviewers) are treated as high signal per `feedback_cross_lineage_review.md`.
|
|
|
|
### Convergent findings — must address before phase 2 design
|
|
|
|
**C1. §1F matrix-indexer claim is OVER-CLAIMED (3/3 reviewers).** The TL;DR #7 line "good for audit defensibility (matrix index can't drive discrimination)" overstates structural subject-agnosticism as behavioral fairness. Per opus: "until §8.1 is executed, the correct framing is 'fingerprint structurally subject-agnostic; body content unverified — treat as suspected PII sink.'" Per kimi: dangerous reasoning — if `reducer_summary` says "candidates named Emily Garcia were rejected for fill capacity," the matrix learns proxy variables and future similar names get downranked = temporal discrimination risk. Per gemini: "you cannot claim audit defensibility until you prove the *content* of the matrix indexer doesn't contain PII." **Action:** walk back the §1F TL;DR claim; reframe as "fingerprint structure is defensive; body content unverified — treat as suspected PII sink until §8.1 confirms."
|
|
|
|
**C2. §1E (Langfuse) is the MOST DANGEROUS leak — fix FIRST (3/3 reviewers).** Opus: "boundary-crossing leak that makes a regulator's eyes light up." Kimi: "Article 44 GDPR transfer if SaaS-hosted; CPRA 'sale/sharing' question; subprocessor notification failure." Gemini: "un-certifiable for SOC2 Type II" + "unauthorized data transfer to third-party storage tier." All three would do Langfuse redaction/sampling BEFORE the §9.1 view-routing fix. **Action:** revise §9 priority order — Langfuse boundary first, view-routing second.
|
|
|
|
**C3. Discrimination defense requires the FULL CANDIDATE POOL, not just fills (3/3 reviewers).** Opus: "you need not just 'what did we do for candidate X' but 'what was the selection rate for protected class Y vs comparator' — that requires capturing protected attributes WITH outcomes." Kimi: "EEOC Uniform Guidelines on Employee Selection Procedures (1978) — matrix index that learns from historical fill outcomes IS a selection procedure under the Guidelines." Gemini: "The system doesn't log the *entire* candidate pool for a specific search — only the fills (§5). To defend a lawsuit, you must show the stats for everyone who *could* have been picked, not just the person who was." **Action:** Phase 1 didn't capture this load-bearing requirement. PRD §1 worked example needs expanding: the audit response must include the comparator pool + adverse-impact statistics, not just the subject's decision row.
|
|
|
|
**C4. BIPA + biometric exposure UNDERSTATED (3/3 reviewers).** Already in `AUDIT_TRAIL_PRD.md` §10.5 jurisdictional checklist but NOT translated into Phase 1 findings actionables. If `workers_500k` columns include photo paths, video interview references, or anything that could yield biometric inference (per gemini: "even descriptors that could be reconstructed into biometric templates"), BIPA's $1k-$5k per-violation regime applies BEFORE the GDPR analysis matters. **Action:** add to §8 (what discovery did NOT cover): explicit photo/video/biometric column audit of `workers_500k` schema.
|
|
|
|
**C5. `candidate_id` must be PROMOTED to top-level field in all JSONL sinks (3/3 reviewers).** Opus + kimi + gemini converge: grepping natural-language strings (operation, raw, prompt, reducer_summary) for candidate_id is not a defensible audit strategy. Even if subject_id appears in those strings TODAY, it appears co-mingled with other candidate names, model reasoning, etc. — making subject filtering unsafe. **Action:** add to PRD §7 target column "subject_id top-level promotion" — change session_log writer + outcomes.jsonl writer + observer event writer to ALL include `subject_id` (or `subject_ids[]`) as a first-class top-level field.
|
|
|
|
### Single-reviewer findings — verified, worth incorporating
|
|
|
|
**OPUS unique:**
|
|
- **8th PII path missing: LLM provider egress.** PR #13 routes models through opencode + ollama_cloud + openrouter — opencode and openrouter are external services. Cross-border data transfer under GDPR Ch. V; third-party processor relationship requiring DPA under Art. 28. Phase 1 did not enumerate this path. **Action:** add §1G "LLM provider egress" to §1.
|
|
- **GDPR Art. 22 / EU AI Act right to explanation.** Audit must capture the model's REASONING CHAIN, not just decision output. Phase 1 §7 worked example doesn't include this. **Action:** add to subject-audit response shape (PRD §2).
|
|
- **Special-category data under GDPR Art. 9.** resume_text + call_log + email_log routinely contain health (accommodation requests), union, religion. Higher legal bar — Art. 9(2) explicit consent required. Phase 1 §6 mentions these tables exist but doesn't flag the special-category exposure. **Action:** call out in §6.
|
|
- **DPIA / ROPA / DPA inventory.** None of these documents referenced. Some may exist outside code; Phase 1 should at minimum note their absence as Phase 1.5 input.
|
|
|
|
**KIMI unique:**
|
|
- **Sequential `CAND-NNNNNN` IDs enable enumeration attacks.** Predictable IDs let an attacker scan the candidate space. Security finding Phase 1 didn't flag.
|
|
- **Langfuse retention config unaudited.** Default trace retention is 30 days in some versions, indefinite in others. Directly impacts RTBF analysis. **Action:** check live Langfuse config.
|
|
- **CCPA "de-identified in place" may be faster than crypto-shred.** Replacing PII with `REDACTED-{hash}` while preserving log structure may satisfy CPRA's de-identified exception. Worth considering vs. crypto-shred.
|
|
- **Bun MCP server is the cross-runtime bridge — likely COMMON-MODE failure.** Phase 1 framed Go side as "mirrors Rust pattern" implying independent failure; it's likely shared infrastructure failure. **Action:** add to §4 — common-mode reframe.
|
|
- **Cryptographic integrity for the audit trail itself.** Merkle trees / signed logs / chain-of-custody under FRE 901. Opposing expert can challenge admissibility without this. **Action:** add to PRD §2 audit response shape — integrity-signed.
|
|
- **HIPAA Business Associate Agreement scoping not done.** If any candidate is healthcare-vertical, BAA analysis required.
|
|
- **`get_candidate` SELECT * has 40+ load-bearing columns.** §9.1 "1-2 hours" estimate for view rewrite is irresponsible without scoping downstream consumers. **Action:** revise §9.1 estimate; flag dependency analysis as prerequisite.
|
|
|
|
**GEMINI unique:**
|
|
- **Data residency — JSONL on US box.** If any candidate is EU resident, GDPR violation without SCC/DPF. Phase 1 didn't ask whether IL+IN target market includes EU residents (probably not, but staffing-co clients sometimes have international placements).
|
|
- **"Culture fit" reasoning strings as discrimination proxies.** Common LLM-generated phrases ("not a culture fit," "communication concerns," "team chemistry") often correlate with protected-attribute discrimination. Phase 1 didn't audit the actual reasoning text in `outcomes.jsonl` for these phrases.
|
|
- **Comparator-pool snapshot for every fill.** Need to capture WHO COULD HAVE BEEN PICKED, not just who was. **Action:** PRD §2 audit response shape needs `comparator_pool` field per decision.
|
|
|
|
### Revised §9 — recommended next moves (reordered by scrum convergence)
|
|
|
|
1. **(NEW PRIORITY 1, was P3) Sample state.json + Langfuse content** — confirm/refute the matrix-indexer subject-clean claim and quantify Langfuse PII exposure. Cheapest move that resolves the over-claim AND informs the boundary-leak fix.
|
|
2. **(NEW PRIORITY 2, was P1) Defense-layer enforcement at SQL template level** — rewrite tool registry to use `_safe` views. **Estimate revised UPWARD per kimi**: scope `get_candidate` SELECT * downstream consumers first; estimate 4-8 hours including the existence-check resolution path through the (not-yet-built) identity service. Stop-gap until then: add LLM-attribution flag to queryd, refuse `FROM candidates `/`FROM workers_500k ` queries that originate from tool dispatch.
|
|
3. **(NEW) Langfuse boundary audit + redaction** — sample retention config, check DPA status, scope a redaction/sampling layer that strips PII from message arrays before the Langfuse POST. This is the boundary-crossing leak — fix BEFORE chasing internal sinks.
|
|
4. **(NEW) Subject_id top-level promotion to all JSONL writers** — single architectural change spanning Rust + Go session_log + observerd event writer + outcomes/corrections appenders. Makes subject-correlation queries defensible (no more grepping natural language strings).
|
|
5. **(was P4) Schema-audit for protected-attribute proxies** — extend to include "culture fit"-shaped reasoning text in outcomes.jsonl + the comparator-pool retention requirement.
|
|
6. **(NEW) BIPA-specific audit of workers_500k schema** — explicit photo/video/biometric column inventory before any production deployment in IL.
|
|
7. **(NEW) Operational discovery** — DPIA, ROPA, DPA inventory, SCC for cross-border, Langfuse retention config. Out-of-code-walk; needs J + counsel input.
|
|
|
|
### What I'm walking back
|
|
|
|
§1F TL;DR claim "matrix indexer is good for audit defensibility" — per all 3 reviewers, this is over-claimed without §8.1 verification. The correct frame is "fingerprint structure is subject-agnostic by design; trace body content unverified — treat as suspected PII sink until sampled."
|
|
|
|
§9 ordering — view-routing was P1; per all 3 reviewers, Langfuse boundary should be P3 in front of it. View-routing is the source-side fix; Langfuse is the boundary-crossing fix; both matter, do them in BOUNDARY-FIRST order.
|
|
|
|
§9.1 effort estimate — kimi's "irresponsible without dependency scoping" critique is right. Revised UP to 4-8 hours.
|
|
|
|
---
|
|
|
|
## Change log
|
|
|
|
- 2026-05-03 — Phase 1 discovery walk complete. Findings cited above with file:line references. No code changes. Companion to `AUDIT_TRAIL_PRD.md`.
|
|
- 2026-05-03 — §10 added: cross-lineage scrum review of the discovery doc (opus + kimi + gemini). 5 convergent findings (matrix-indexer over-claim, Langfuse first-priority, comparator pool gap, BIPA understated, subject_id top-level promotion). Plus single-reviewer high-value catches. §9 reordered.
|