lakehouse/docs/EXECUTION_PLAN.md
root dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00

12 KiB
Raw Blame History

Execution Plan — Phases B through E

Created: 2026-04-16 Status: Active planning document — update as phases complete or scope shifts Owner: J

This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns.


The four phases, at a glance

Phase Work Prereq Estimated cost Risk
B Lance pilot on one vector index None 1 focused session Medium — new dep, unfamiliar surface
C Decoupled embedding refresh pipeline Benefits from B's outcome 1 focused session Low — additive, doesn't break existing
D AI-safe views Phase 13 (done) 1 focused session Low — builds on existing catalog + tool registry
E Soft deletes / tombstones None 1-2 focused sessions Medium — touches query path, compaction

Each "focused session" ≈ 3-4 hours of coding + verification + doc update.


Phase B — Lance pilot (the storage format question)

Why now

J's LLMS3 knowledge base explicitly positions Lance as alternative_to Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why.

Commit to one answer backed by measurements. No ambiguity after this phase.

Scope

  1. Add lance crate as a dep in vectord, behind a feature = "lance" flag initially so we can build without forcing users to install it
  2. New module vectord::lance_store — mirror of vectord::store but against Lance format
  3. New endpoint POST /vectors/lance/index — build a Lance index for a named source (parallel to the existing Parquet path)
  4. Benchmark script that runs against resumes_100k_v2 (the existing 100K reference)

Measured dimensions

All benchmarks against resumes_100k_v2 (100K × 768d), cold start:

Metric Parquet baseline Lance Threshold to migrate
Cold load from disk ~2.8s (measured) ? ≥2× faster
Search latency p50 873us (ec=80 es=30) ? Within 50%
Disk size 330MB ? Comparable or better
Single-row random access N/A (full scan) ? <10ms
Append 10K new rows Full rewrite (~3s) ? Incremental <500ms

Decision rules

  • Lance wins cold-load by ≥2× AND matches search latency: migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019.
  • Lance within 50% across board: stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve.
  • Lance loses: close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers."

Success gate

Benchmark output table posted to docs/ADR-019-vector-storage.md with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call."

Rollback

The feature = "lance" flag means if the pilot goes badly, cargo build without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment.


Phase C — Decoupled embedding refresh

Why now

llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means:

  • Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing)
  • No notion of "embeddings are stale, schedule a refresh tonight"
  • The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest

Scope

  1. Add fields to DatasetManifest:
    • last_embedded_at: Option<DateTime>
    • embedding_stale_since: Option<DateTime> (set when data written but embeddings not refreshed)
    • embedding_refresh_policy: RefreshPolicyManual | OnAppend | Scheduled(cron)
  2. Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately
  3. New endpoint: POST /vectors/refresh/{dataset} — diffs existing vectors vs current rows, only embeds new/changed (keyed by doc_id)
  4. Background scheduler (or cron trigger) — for Scheduled policy, re-runs refresh per schedule
  5. GET /vectors/stale — lists datasets with stale embeddings and how stale

Measured success

  • Ingest a 1K-row append to kb_team_runs (currently 586 rows, Postgres-sourced).
  • Catalog shows embedding_stale_since = now.
  • POST /vectors/refresh/kb_team_runs embeds only the 1K new rows, not all 1586.
  • Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s.

Dependencies on Phase B

If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet.

Rollback

The refresh_policy field defaults to Manual for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset.


Phase D — AI-safe views

Why now

llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create candidates_safe once, bind model X to that view, model X can never accidentally see raw candidates.

This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset."

Scope

  1. New catalog entity: AiView with fields name, base_dataset, columns[] (whitelist), row_filter (optional SQL WHERE clause), column_redactions[] (PII tokenization rules)
  2. Persistence: _catalog/views/{name}.json alongside manifests
  3. Query-rewrite layer: when a query references candidates_safe, DataFusion sees an equivalent SELECT (whitelisted cols) FROM candidates WHERE (row_filter) — with redactions applied as expressions
  4. Endpoint: POST /catalog/views to create, GET /catalog/views to list, GET /catalog/views/{name}/preview to see what it looks like
  5. Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view

Measured success

  • Create view candidates_safe = SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'.
  • Agent (tool registry) calls search_candidates bound to candidates_safe.
  • Agent cannot see email, phone, ssn, or status='blocked' rows, even if it writes raw SQL.
  • Audit log records agent accessed candidates_safe, not candidates.

Dependencies

  • Phase 13 already provides the sensitivity classification layer
  • Phase 12 tool registry already exists
  • This phase is the bridge between them for agent access

Rollback

Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed.


Phase E — Soft deletes / tombstones

Why now

GDPR/CCPA compliance for staffing data. Today, ops::delete physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole candidates.parquet which at 100K rows is 10MB of churn per deletion.

llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system.

Scope

  1. New sidecar per dataset: {dataset}_tombstones.parquet with columns {row_key, deleted_at, actor, reason}
  2. Delete API: POST /catalog/datasets/{name}/tombstone with {row_keys[], reason, actor}
  3. Query-time filter: queryd automatically LEFT JOINs tombstones and filters out deleted rows
  4. Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar
  5. Event journal integration (Phase 9): every tombstone emits a journal event with full context

Measured success

  • POST /catalog/datasets/candidates/tombstone with {row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}
  • SELECT COUNT(*) FROM candidates drops by 1 immediately
  • SELECT * FROM candidates WHERE candidate_id = 'CAND-123' returns empty
  • GET /journal/history/CAND-123 shows the tombstone event
  • After scheduled compaction, the tombstone is materialized — candidates.parquet no longer contains CAND-123, tombstone sidecar is emptied for that row key

Dependencies

  • Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time
  • Phase 9 event journal (done) — tombstones emit journal events

Rollback

If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer.


Cross-phase concerns

Phases that need federation layer 2 (task #5)

Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, X-Lakehouse-Bucket header).

Implication: Phases B-E can proceed on the primary bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence:

A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E

Phases that need federation layer 2 FIRST

None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily.

What NOT to build in B-E

  • Distributed query — wait for a real scale problem
  • Replacement of DataFusion — working fine, stay put
  • Iceberg/Delta Lake migration — explicitly out of scope per ADR-009
  • Live streaming / CDC — explicit non-goal

Definition of done for each phase

Each phase completes when:

  1. Code shipped and building clean
  2. Success gate measurably passed
  3. Relevant ADR added to docs/DECISIONS.md (or updated)
  4. docs/PHASES.md checkbox flipped with measurement data
  5. PRD invariants checked — if a new invariant emerged, add it
  6. One regression test in the crate or HTTP integration test

Session plan (what to do in what order)

Next session

  1. Phase B — Lance pilot. Single session. Answers the biggest open architectural question.
  2. Based on outcome, write ADR-019 with the decision + data.
  3. Update PHASES.md with Phase 18 status (Lance evaluated).

Session after

  1. Phase C — Decoupled embedding refresh. Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet).

Session after that

  1. Federation layer 2 OR Phase D (AI-safe views) — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value.

Final session for this track

  1. Phase E — Soft deletes. The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize.

Milestone checkpoint

After Phase E, stop and reassess. We'll have:

  • Lance decision made and committed
  • Decoupled embedding pipeline
  • AI-safe view enforcement
  • Soft delete semantics

That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.