PRD v3: future-proofing roadmap — event journal, rich catalog, tool registry

Phases 9-15 designed based on "future regret" analysis: - Phase 9: Event journal (append-only mutation history — can't retrofit) - Phase 10: Rich catalog v2 (ownership, sensitivity, lineage, freshness) - Phase 11: Embedding versioning (model-proof vector layer) - Phase 12: Tool registry (governed agent actions via MCP) - Phase 13: Security & access control (field-level, row-level, audit) - Phase 14: Schema evolution with AI migration rules - Phase 15+: Federated query, DB connectors, OCR, fine-tuned models 8 design principles: store truth openly, describe richly, never destroy evidence, secure centrally, expose through tools, version everything, unstructured first-class, separate storage/compute/intelligence. ADR-012 through ADR-016 documenting key future-proofing decisions. Updated benchmarks: 2.47M rows, hot cache 9.8x speedup. Updated operating rules: cheap-now/expensive-later built first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:57:29 -05:00 · 2026-03-27 08:57:29 -05:00 · 354c9c4a04
commit 354c9c4a04
parent 0b9da45647
2 changed files with 193 additions and 28 deletions
--- a/docs/DECISIONS.md
+++ b/docs/DECISIONS.md
@ -54,3 +54,28 @@
 **Date:** 2026-03-27
 **Decision:** The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations.
 **Rationale:** Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.
+
+## ADR-012: Event journal — append-only mutation history
+**Date:** 2026-03-27
+**Decision:** Every data mutation (insert, update, delete) is appended to an immutable event journal. The journal stores: entity, field, old_value, new_value, actor, timestamp, source, workspace_id. Events are never modified or deleted.
+**Rationale:** This is the single most important future-proofing decision. AI auditability ("why did the agent recommend this candidate?"), compliance ("who changed this PII field?"), and time-travel queries ("what did this record look like on March 1st?") all require mutation history. This is impossible to retrofit — once history is lost, it's gone. The cost to implement is low (append-only Parquet), the cost of NOT implementing grows every day.
+
+## ADR-013: Rich metadata is a product, not a byproduct
+**Date:** 2026-03-27
+**Decision:** Every dataset in the catalog carries: owner, sensitivity classification, lineage (source_system → ingest → dataset), freshness SLA, description, and tags. Auto-detected where possible, required on manual ingest.
+**Rationale:** Datasets without metadata become "mystery files" within months. As data volume grows and AI agents consume data, the metadata layer is what makes the platform discoverable, governable, and trustworthy. Legacy companies that skip this step end up with expensive data swamps instead of data platforms.
+
+## ADR-014: Embedding versioning — model-proof vector layer
+**Date:** 2026-03-27
+**Decision:** Every vector index tracks: model_name, model_version, dimensions, created_at. Multiple index versions for the same data coexist. Re-embedding on model upgrade is incremental (only new/changed docs).
+**Rationale:** Embedding models improve rapidly. nomic-embed-text today, something better in 6 months. Without version tracking, upgrading means re-embedding the entire corpus. With versioning, you can A/B test new models, migrate incrementally, and maintain backward compatibility.
+
+## ADR-015: Tool registry before raw SQL for agents
+**Date:** 2026-03-27
+**Decision:** AI agents interact with the system through named, governed tools (search_candidates, update_phone, create_placement) rather than raw SQL. Tools have parameter validation, permission checks, audit logging, and rate limits.
+**Rationale:** In 3 years, most data access will be by AI agents, not humans typing SQL. Giving agents raw SQL access is ungovernable — you can't audit, permission, or rate-limit individual operations. Named tools with contracts are the interface that scales. Building the governed interface first prevents the technical debt of retrofitting controls onto raw access.
+
+## ADR-016: Agent workspaces as first-class concept
+**Date:** 2026-03-27
+**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
+**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."
--- a/docs/PRD.md
+++ b/docs/PRD.md
@ -173,38 +173,147 @@ Make frequently-accessed data fast, and handle real-time updates without full re
 - Compaction (moderate)
 - Full ACID isolation (NOT attempting — single-writer model instead)

-### Phase 9+: Future (not designed yet)
+### Phase 8.5: Agent Workspaces ✅ COMPLETE

- Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet)
- PDF/document ingest (OCR → text → chunks → embed)
- Scheduled ingest (cron-based file watching)
- Multi-node query distribution
- Row-level access control
- Audit log (who queried what, when)
+Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
+
+- WorkspaceManager with saved searches, shortlists, activity logs
+- Zero-copy handoff between agents (pointer swap, not data copy)
+- Persisted to object storage, rebuilt on startup
+
+### Phase 9: Event Journal — Never Destroy Evidence
+
+**Principle:** Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 9.1 | `journald` crate: append-only event log as Parquet | Every write/update/delete logged with who, when, what, old value, new value |
+| 9.2 | Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id | Standardized across all mutations |
+| 9.3 | Journal query: `SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp` | Full history of any record |
+| 9.4 | Replay capability: rebuild any dataset's state at any point in time | Time-travel queries |
+| 9.5 | Journal compaction: roll old events into monthly summary Parquet files | Prevent unbounded growth |
+
+**Gate:** Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.
+
+**Why now:** In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.
+
+### Phase 10: Rich Catalog v2 — Metadata as Product
+
+**Principle:** Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 10.1 | Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage | `GET /catalog/datasets` returns rich metadata |
+| 10.2 | Sensitivity classification: PII, PHI, financial, public, internal | Sensitive fields tagged at ingest |
+| 10.3 | Lineage tracking: source_system → ingest_job → dataset → derived_dataset | "Where did this data come from?" answerable |
+| 10.4 | Freshness contracts: expected_update_frequency, last_updated, stale_after | Alert when data goes stale |
+| 10.5 | Dataset contracts: required columns, type expectations, validation rules | Ingest rejects data that breaks the contract |
+| 10.6 | Auto-documentation: AI generates dataset description from schema + sample data | New datasets self-describe on ingest |
+
+**Gate:** Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.
+
+**Why now:** Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.
+
+### Phase 11: Embedding Versioning — Model-Proof Vector Layer
+
+**Principle:** Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 11.1 | Vector index metadata: model_name, model_version, dimensions, created_at | Every index knows its embedding model |
+| 11.2 | Multi-version indexes: same data, different models, coexist | Search specifies which model version |
+| 11.3 | Incremental re-embed: only new/changed docs get re-embedded on model upgrade | Model swap doesn't require full re-embed |
+| 11.4 | A/B search: query both old and new model, compare results | Validate model upgrade before committing |
+
+**Gate:** Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.
+
+### Phase 12: Tool Registry — Agent-Safe Business Actions
+
+**Principle:** In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 12.1 | Tool definition: name, description, parameters, permissions, audit_level | `search_candidates(skills, city, min_years)` as a registered tool |
+| 12.2 | Tool execution: validates params, checks permissions, logs usage, runs query | Agent calls tool, gets results, action is logged |
+| 12.3 | Read vs write tools: read tools are permissive, write tools require confirmation | `get_candidate` = auto-approved, `update_phone` = requires review |
+| 12.4 | MCP-compatible interface: expose tools via Model Context Protocol | Any MCP-compatible agent (Claude, GPT, local) can use them |
+| 12.5 | Rate limiting + quotas per agent/tool | Prevent runaway agent from overwhelming the system |
+
+**Gate:** An AI agent calls `search_candidates(skills="Python,AWS", city="Chicago", available=true)` → gets results → calls `shortlist_candidate(workspace_id, candidate_id, reason)` → action is logged, auditable, reversible.
+
+**Why now:** The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.
+
+### Phase 13: Security & Access Control
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 13.1 | Field-level sensitivity tags (PII, PHI, financial) in catalog | Sensitive fields identified |
+| 13.2 | Row-level access policies (agent A sees their candidates only) | Policy evaluated at query time |
+| 13.3 | Column masking (show last 4 of SSN, redact salary for non-managers) | Masked results based on role |
+| 13.4 | Query audit log (who queried what, when, which fields) | Every data access recorded |
+| 13.5 | Policy-as-code (TOML/YAML rules, not hardcoded) | Non-engineer can update access rules |
+
+### Phase 14: Schema Evolution + AI Migration
+
+| Step | Deliverable | Gate |
+|---|---|---|
+| 14.1 | Schema diff detection: old schema vs new ingest → list changes | "Column renamed: first_name → full_name" |
+| 14.2 | AI-generated migration rules: LLM suggests column mappings | "full_name = concat(first_name, ' ', last_name)" |
+| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
+| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |
+
+### Phase 15+: Horizon
+
+- Federated multi-bucket query (client A's S3 + client B's S3 + yours)
+- Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet via CDC)
+- PDF OCR for scanned documents (Tesseract integration)
+- Scheduled ingest (cron-based file watching, S3 event triggers)
+- Specialized fine-tuned models per domain (staffing matcher, resume parser)
+- Multi-node query distribution (DataFusion supports this architecturally)
+- Video/audio transcript ingest + multimodal embeddings

 ---

 ## Reference Dataset: Staffing Company

-Validated with realistic staffing company data:
+Scale-tested on 128GB RAM server:

-| Table | Rows | Description |
-|---|---|---|
-| candidates | 15,000 | Names, phones, emails, zip, skills, resume text, availability |
-| clients | 500 | Companies, contacts, verticals, bill rates |
-| job_orders | 3,000 | Positions with descriptions, requirements, rates |
-| placements | 8,000 | Candidate↔job matches with dates, rates, recruiters |
-| timesheets | 120,000 | Weekly hours, bill/pay totals, approvals |
-| call_log | 80,000 | Phone CDR — who called whom, duration, disposition |
-| email_log | 60,000 | Email tracking — subject, opened, direction |
-| **Total** | **286,500** | **7 tables, cross-referenced** |
+| Table | Rows | Size | Description |
+|---|---|---|---|
+| candidates | 100,000 | 10.1 MB | Names, phones, emails, zip, skills, resume text |
+| clients | 2,000 | 33 KB | Companies, contacts, verticals |
+| job_orders | 15,000 | 0.9 MB | Positions with descriptions, requirements, rates |
+| placements | 50,000 | 1.2 MB | Candidate↔job matches with rates, recruiters |
+| timesheets | 1,000,000 | 16.7 MB | Weekly hours, bill/pay totals, approvals |
+| call_log | 800,000 | 34.3 MB | Phone CDR — who called whom, duration, disposition |
+| email_log | 500,000 | 16.0 MB | Email tracking — subject, opened, direction |
+| **Total** | **2,467,000** | **79 MB** | **7 tables, cross-referenced** |

-Proven queries:
- Candidate search by skills + location + availability: 80ms
- Revenue by client with profit margins (JOIN 120K timesheets): 142ms
- Cold lead detection (candidates called 5+ times, never placed): 94ms
- Margin analysis by vertical (JOIN placements → job orders): 53ms
- Natural language → AI-generated SQL → execution → results: ~3s (model inference)
+### Benchmarks (2.47M rows)
+
+| Query | Cold (Parquet) | Hot (MemCache) | Speedup |
+|---|---|---|---|
+| 100K candidate filter (skills+city+status) | 257ms | 21ms | 12x |
+| 1M timesheet aggregation + JOIN | 942ms | 96ms | 9.8x |
+| 800K call log cross-reference (cold leads) | 642ms | — | — |
+| Triple JOIN recruiter performance | 487ms | — | — |
+| 500K email open rate aggregation | 259ms | — | — |
+| COUNT all 2.47M rows | 84ms | — | — |
+| 10K vector semantic search (cosine) | 450ms | — | — |
+| Natural language → AI SQL → execute | ~3s | — | (model inference) |
+
+### Vector Search
+
+- 10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
+- Semantic search over 10K vectors: ~450ms (brute-force cosine)
+- RAG pipeline: question → embed → search → retrieve → LLM answer with citations
+- AI correctly refuses to hallucinate when context doesn't support an answer
+
+### Agent Workspaces
+
+- Create per-contract workspace with saved searches + shortlists
+- Instant handoff between agents — zero data copy
+- Full activity timeline preserved across handoffs

 ---

@ -233,16 +342,45 @@ Proven queries:

 ## Risks

+### Technical Risks
+
 | Risk | Severity | Mitigation |
 |---|---|---|
 | Vector search in Rust at scale | **High** | Start brute-force, evaluate `hora` crate, Qdrant as fallback |
 | Incremental updates on Parquet | **High** | Delta files + merge-on-read, NOT full Delta Lake |
 | Legacy data messiness | **High** | Conservative schema detection, default to string, user overrides |
-| Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests |
-| Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit |
+| 100K+ embedding timeout | **High** | Async background job with progress, not single HTTP request |
+| Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests (Phase 14) |
+| Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows) |
 | HNSW index persistence | **Medium** | Serialize alongside Parquet, rebuild on startup |
 | Python sidecar as bottleneck | **Low** | Can replace with direct Ollama HTTP from Rust later |

+### Strategic Risks (Future-Proofing)
+
+| Risk | Impact | Phase |
+|---|---|---|
+| No mutation history → can't audit AI decisions | **Critical** — compliance, trust | Phase 9 (event journal) |
+| No metadata → datasets become mystery files | **High** — onboarding, discovery | Phase 10 (rich catalog) |
+| Embeddings locked to one model | **High** — can't upgrade models | Phase 11 (versioning) |
+| Raw SQL as only interface → ungoverned agent access | **High** — security, auditability | Phase 12 (tool registry) |
+| No sensitivity classification → compliance exposure | **Medium** — grows with data volume | Phase 13 (access control) |
+| No schema evolution handling → ingest breaks on format change | **Medium** — grows with source count | Phase 14 (AI migration) |
+
+---
+
+## Design Principles (Future-Proofing)
+
+These are the decisions that still look smart after the stack changes:
+
+1. **Store the truth openly.** Parquet on object storage. No proprietary formats. Any engine can read it.
+2. **Describe it richly.** Every dataset has an owner, lineage, sensitivity tags, freshness contract.
+3. **Never destroy evidence.** Every mutation is journaled. Rebuild any state at any point in time.
+4. **Secure it centrally.** Permissions live in the data layer, not application code.
+5. **Expose it through reusable interfaces.** Named tools with contracts, not raw SQL for every consumer.
+6. **Version everything.** Schemas, embeddings, models — all versioned, all coexist during migration.
+7. **Make unstructured data first-class.** Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
+8. **Separate storage from compute from intelligence.** Scale each independently. Replace any layer without touching the others.
+
 ---

 ## Operating Rules
@ -253,5 +391,7 @@ Proven queries:
 4. No silent architecture drift
 5. Always work in smallest valid step
 6. Always verify before moving on
-7. **New:** Flag when something is genuinely hard vs just engineering work
-8. **New:** If a phase reveals the approach is wrong, update the PRD before continuing
+7. Flag when something is genuinely hard vs just engineering work
+8. If a phase reveals the approach is wrong, update the PRD before continuing
+9. **Cheap-now, expensive-later decisions get built first** (event journal, metadata, versioning)
+10. **Build the governed interface before the raw interface** (tools before SQL for agents)