lakehouse/auditor all checks passed (3 findings, all info)
Three artifacts in one PR:
1. docs/PYTHON_INVENTORY.md — every .py file in the repo classified:
Production (sidecar routers + 3 systemd services), Documented
(kb_measure, kb_staffer_report), Manual (one-off tools), Dead
(sidecar/sidecar/lab_ui.py + pipeline_lab.py are genuinely
not imported anywhere).
2. docs/COHESION_INTEGRATION_PLAN.md — the "smarter DB" loop J
called out as missing. Six phases A-F. Phase A ships here; B-F
are named + sequenced for follow-up PRs. Each phase adds ONE
wire of the loop; no single PR does them all.
3. Phase A wire (auditor verdicts → observer + KB):
- auditor/audit.ts: after assembleVerdict, fire-and-forget POST
to :3800/event with source="auditor" AND append to
data/_kb/outcomes.jsonl with kind="audit". Errors log + drop
— the verdict is still on disk at _auditor/verdicts/.
- mcp-server/observer.ts: extend source union to include
"auditor" | "bot" (was "mcp" | "scenario" only, which silently
coerced my first auditor POST to source="scenario"). Accept
body.ok OR body.success. Accept body.audit_duration_ms as a
fallback for duration_ms. Uses body.one_liner as
output_summary when set.
Live-verified after observer restart:
re-audit PR #6 → verdict=request_changes, 4 findings (1 warn)
observer: by_source={'auditor': 1} (previously coerced to 'scenario')
_kb/outcomes.jsonl tail: kind=audit sig=pr6-7fe47bab
pr=6 overall=request_changes
The shape of the loop is now visible to downstream consumers. Phase
B (auditor's kb_query check reads these audit rows for history)
lands in a follow-up PR. Phase C-F similar.
NOT in this PR:
- Actually deleting lab_ui.py + pipeline_lab.py (operator decision,
called out in the inventory doc)
- Cleaning up the 5 overlapping Python scripts (same)
- Phases B-F of the cohesion plan (separate PRs per wire)
- Integration test that asserts "smarter DB" across runs (Phase F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase J keeps asking for: playbooks know which external docs they
used, get flagged when those docs drift. This commit ships the data
model; context7 bridge + drift check endpoints land in follow-ups.
Added to crates/vectord/src/playbook_memory.rs:
- pub struct DocRef { tool, version_seen, snippet_hash, source_url,
seen_at } — one external doc reference
- PlaybookEntry.doc_refs: Vec<DocRef> — empty on legacy entries,
serde default ensures pre-Phase-45 persisted state loads cleanly
- PlaybookEntry.doc_drift_flagged_at: Option<String> — set by the
(future) drift-check code when context7 reports newer version
- PlaybookEntry.doc_drift_reviewed_at: Option<String> — set by
human via /resolve endpoint after reviewing the diagnosis
- impl Default for PlaybookEntry — collapses most test-helper
constructors from 17 explicit fields to 6-9 fields +
..Default::default()
Updated SeedPlaybookRequest + RevisePlaybookRequest (service.rs) to
accept optional doc_refs: the seed/revise endpoints already take the
field, downstream drift detection (Phase 45.2) consumes it.
Docs: docs/CONTROL_PLANE_PRD.md gains full Phase 45 spec with gate
criteria, non-goals, and risk notes.
Tests: 51/51 vectord lib tests green (same count as before, field
additions are backward-compat).
Memory: project_doc_drift_vision.md written so this keeps coming
back to the front of mind.
Next slices (same phase): context7 HTTP bridge in mcp-server,
/vectors/playbook_memory/doc_drift/check/{id} endpoint, overview-
model drift synthesis writing to data/_kb/doc_drift_corrections.jsonl,
boost exclusion for flagged+unreviewed entries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J flagged that a prior version of this stack had Langfuse traces
piping into the observer + Gitea MCP for repo ops — lost. Adding
these as explicit Phase 40 deliverables alongside routing engine
+ Gemini/Claude adapters.
Findings during scope-check:
- Langfuse container is already running (Up 2 days, langfuse:2,
localhost:3001 healthcheck passes)
- mcp-server/tracing.ts + package.json already have SDK wired
- Credentials pk-lf-staffing / sk-lf-staffing-secret (from env)
- Gitea MCP binary still installed at gitea-mcp@0.0.10
So recovery here is mostly re-connecting existing infra:
1. Add Rust-side Langfuse client for /v1/chat tracing (gateway
currently bypasses tracing, mcp-server already has it)
2. Wire Langfuse → observer :3800 pipe
3. Register Gitea MCP in mcp-server/index.ts tool list
Each landing as part of Phase 40 when the routing engine ships.
Phase 21 — Rust port of scratchpad + tree-split primitives (companion to
the 2026-04-21 TS shipment). New crates/aibridge modules:
context.rs — estimate_tokens (chars/4 ceil), context_window_for,
assert_context_budget returning a BudgetCheck with
numeric diagnostics on both success and overflow.
Windows table mirrors config/models.json.
continuation.rs — generate_continuable<G: TextGenerator>. Handles the
two failure modes: empty-response from thinking
models (geometric 2x budget backoff up to budget_cap)
and truncated-non-empty (continuation with partial
as scratchpad). is_structurally_complete balances
braces then JSON.parse-checks. Guards the degen case
"all retries empty, don't loop on empty partial".
tree_split.rs — generate_tree_split map->reduce with running
scratchpad. Per-shard + reduce-prompt go through
assert_context_budget first; loud-fails rather than
silently truncating. Oldest-digest-first scratchpad
truncation at scratchpad_budget (default 6000 t).
TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient
implements it; ScriptedGenerator test double lets tests inject canned
sequences without a live Ollama.
GenerateRequest gained think: Option<bool> — forwards to sidecar for
per-call hidden-reasoning opt-out on hot-path JSON emitters. Three
existing callsites updated (rag.rs x2, service.rs hybrid answer).
Phase 27 — Playbook versioning. PlaybookEntry gained four optional
fields (all #[serde(default)] so pre-Phase-27 state loads as roots):
version u32, default 1
parent_id Option<String>, previous version's playbook_id
superseded_at Option<String>, set when newer version replaces
superseded_by Option<String>, the playbook_id that replaced
New methods:
revise_entry(parent_id, new_entry) — appends new version, stamps
superseded_at+superseded_by on parent, inherits parent_id and sets
version = parent + 1 on the new entry. Rejects revising a retired
or already-superseded parent (tip-of-chain is the only valid
revise target).
history(playbook_id) — returns full chain root->tip from any node.
Walks parent_id back to root, then superseded_by forward to tip.
Cycle-safe.
Superseded entries excluded from boost (same rule as retired): filter
in compute_boost_for_filtered_with_role (both active-entries prefilter
and geo-filtered path), rebuild_geo_index, and upsert_entry's existing-
idx search. status_counts returns (total, retired, superseded, failures);
/status JSON reports active = total - retired - superseded.
Endpoints:
POST /vectors/playbook_memory/revise
GET /vectors/playbook_memory/history/{id}
Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26
shipped. Fixes applied:
- Phase 24 marked shipped (commit b95dd86) with detail of observer
HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED"
rewritten to reflect shipped state.
- Phase 25 (validity windows, commit e0a843d) added to PHASES +
PRD.
- Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added.
- Phase 27 entry added to both docs.
- Phase 19.6 time decay corrected: was documented as "deferred",
actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs.
- Phase E/Phase 8 tombstone-at-compaction limit note updated —
Phase E.2 closed it.
Tests: 8 new version_tests in vectord (chain-metadata stamping,
retired/superseded parent rejection, boost exclusion, history from
root/tip/middle, legacy default round-trip, status counts). 25 new
aibridge tests (context/continuation/tree_split). Workspace total
145 green (was 120).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J flagged the audit: "make sure everything flows coherently, no
pseudocode or unnecessary patches or ignoring any particular part of
what we built." This is that pass.
PRD.md updates:
- Phase 19 refinement block — geo-filter + role-prefilter WIRED with
citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario).
- Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path,
think:false as the key mechanical finding, kimi-k2.6 upgrade path.
- Phase 21 status block — think plumbing + cloud executor routing
added after original commit.
- Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified
1/3 on stress_01.
- Phase 23 NEW — staffer identity + tool_level + competence-weighted
retrieval + kb_staffer_report. Auto-discovered worker labels called
out with real numbers (Rachel Lewis 12× across 4 staffers).
- Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not
fixed. Observer has been idle at 0 ops for 3600+ cycles because
scenarios hit gateway:3100 directly, bypassing MCP:3700 which the
observer wraps. This is the honest "we're not using it in these
tests" signal J surfaced. Fix deferred; gap visible now.
PHASES.md:
- Appended Phases 20-23 as checked, Phase 24 as unchecked gap.
- Updated footer count: 102 unit tests across all layers.
- Latest line updated with 14× citation lift + 46.4pt tool-asymmetry
finding.
scenario.ts:
- snapshotConfig() was defined but never called. Now fires at every
scenario start with a stable sha256 hash over the active model set +
tool_level + cloud flags. config_snapshots.jsonl finally populates,
which the error_corrections diff path needs to work correctly.
kb.test.ts (new): 4 signature invariant tests — stability across
unrelated fields (date, contract, staffer), sensitivity to role/city/
count changes, digest shape. All pass under `bun test`.
service.rs: 6 Rust extractor tests for extract_target_geo +
extract_target_role — basic, missing-state-returns-none, word
boundary (civilian != city), multi-word role, absent role, quoted
value parse. All pass under `cargo test -p vectord --lib extractor_tests`.
Dangling items now honestly documented rather than silently pending:
- Chunking cache (config/models.json SPEC, not wired) — flagged
- Playbook versioning (SPEC, not wired) — flagged
- Observer integration (WIRED but disconnected) — new Phase 24
Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which
WORKERS worked for this event"; KB answers "which CONFIG worked for
this playbook signature" — model choice, budget hints, pathway notes,
error corrections.
tests/multi-agent/kb.ts:
- computeSignature(): stable sha256 hash of the (kind, role, count,
city, state) tuple sequence. Same scenario shape → same sig.
- indexRun(): extracts sig, embeds spec digest via sidecar, appends
outcome record, upserts signature to data/_kb/signatures.jsonl.
- findNeighbors(): cosine-ranks the k most-similar signatures from
prior runs for a target spec.
- detectErrorCorrections(): scans outcomes for same-sig fail→succeed
pairs, diffs the model set, logs to error_corrections.jsonl.
- recommendFor(): feeds target digest + k-NN neighbors + recent
corrections to the overview model, gets back a structured JSON
recommendation (top_models, budget_hints, pathway_notes), appends
to pathway_recommendations.jsonl. JSON-shape constrained so the
executor can inherit it mechanically.
- loadRecommendation(): at scenario start, pulls newest rec matching
this sig (or nearest).
scenario.ts:
- Reads KB recommendation at startup (alongside prior lessons).
- Injects pathway_notes into guidanceFor() executor context.
- After retrospective, indexes the run + synthesizes next rec.
Cold-start behavior: first run with no history writes a low-confidence
"no prior data" rec so the signal that something was attempted is
captured. Second run gets "low confidence, 0 neighbors" until a third
distinct sig gives the embedder something to compare against — hence
the upcoming scenario generator.
VERIFIED:
- data/_kb/ populated after one scenario run: 1 outcome (sig=4674…,
4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run).
- Recommendation JSON-parsed cleanly from gpt-oss:20b overview model.
PRD Phase 22 added with file layout, cycle description, and the
rationale for file-based MVP → Rust port progression that matches
how Phase 21 primitives shipped.
What's NOT here yet (batched follow-ups per J's request, tested
between each):
- Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20)
- Scenario generator to bulk-populate KB with varied signatures
- Rust re-weighting: push playbook_memory success signal INTO
hybrid_search scoring, not just post-hoc boost
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.
config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
chunk_lessons_via_cosine_topk, two_pass_map_reduce,
escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)
agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked
scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
/vectors/playbook_memory/seed path was silently failing with HTTP 400
because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line
Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.
Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
The staffing company said: 'we don't have any of that data.'
They're right. We showed a demo with 18-field profiles and they
have a name and a phone number.
This ADR documents the trust path:
Phase 1 (Day 1): Work with name + phone + role. That's enough.
Phase 2 (Week 1-4): Timesheets → reliability. Calls → history.
Phase 3 (Month 2+): AI starts helping with real earned data.
Key principles:
- Never show empty fields or 0% bars
- Show what's THERE, not what's missing
- Trust indicators: 'based on 3 placements' not just 'Reliability: 87%'
- The system earns data by being useful, not by demanding it upfront
Also created sparse_workers dataset (200 workers, 74% have role,
34% have notes, 5 have ONLY name+phone) for realistic testing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
THE REAL PROBLEM: Every new data source produces different doc_id
prefixes in vector indexes (W-, W500K-, W5K-, CAND-). Hybrid search
had to hardcode strip_prefix for each one. New datasets broke hybrid
until someone added another prefix. This violates "any data source
without pre-defined schemas."
THE FIX: IndexMeta.id_prefix — the catalog records what prefix each
index uses. Hybrid search reads it and strips automatically. Legacy
indexes fall back to heuristic stripping. New indexes can set
id_prefix=None to use raw IDs (no prefix, no stripping needed).
This means: ingest a new dataset, embed it, hybrid search works
immediately without code changes. The system is truly source-agnostic.
Also: full ADR document at docs/ADR-020-universal-id-mapping.md
with the three options considered and rationale for the chosen approach.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Status updated to reflect hybrid SQL+vector search, IVF_PQ 0.97
recall, 10K Ethereal worker profiles, autonomous agent validation.
Query Paths section updated with the shipped hybrid endpoint and
its verified zero-hallucination results from the staffing simulation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.
delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
BooleanArray keep-mask per batch (True where row_key_value is NOT
in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success
Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).
Fix:
- Single ArrowWriter per compaction, writes every batch into one
properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
temp; only then delete delta files. Any error along the way leaves
the original base intact.
TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.
QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.
E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
99997 (the 3 IDs are physically gone from disk)
PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.
Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
flush_threshold=1 per tombstone; TombstoneStore::compact exists
but not auto-called)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the self-iteration loop from the PRD reframe: an agent can
tune HNSW configs autonomously and the winner flows through to the
next profile activation without human intervention.
Three primitives:
1. PromotionRegistry (vectord::promotion)
- Per-index current + history at _hnsw_promotions/{index}.json
- promote(index, entry) atomically swaps current, pushes prior
onto history (capped at 50)
- rollback() pops history back onto current; clears current if
history exhausted
- config_or(index, default) — the read side used at build time,
returns promoted config if set else caller's default
- Full cache + persistence; writes are durable on return
2. Autotune (vectord::autotune)
- run_autotune(request, ...) — synchronous agent loop
- Default grid: 5 configs covering the practical range
(ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for
reproducibility
- Every trial goes through the existing trial-journal pipeline
so autotune runs land alongside manual trials in the
"trials are data" log
- Winner: max recall first, then min p50 latency; must clear
min_recall gate (default 0.9) or no promotion happens
- Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd
values from the request's optional custom grid
- On winner: promote with note "autotune winner: recall=X p50=Y"
3. Wiring
- VectorState gains promotion_registry
- activate_profile now calls promotion_registry.config_or(...)
so newly-promoted configs are picked up on next activation —
the "hot-swap" is: autotune promotes -> profile activates ->
HNSW rebuilt with new config
- New endpoints:
POST /vectors/hnsw/promote/{index}/{trial_id}
?promoted_by=...¬e=...
POST /vectors/hnsw/rollback/{index}
GET /vectors/hnsw/promoted/{index}
POST /vectors/hnsw/autotune { index_name, harness,
min_recall?, grid? }
End-to-end verified on threat_intel_v1 (54 vectors):
- autogen harness 'threat_intel_smoke' (10 queries)
- POST /autotune -> 5 trials in 620ms, winner ec=20 es=30
recall=1.00 p50=64us auto-promoted
- Manual promote of ec=80 es=30 -> history depth 1
- Rollback -> back to ec=20 es=30 autotune winner
- Second rollback -> current cleared
- Re-promote + restart -> persistence verified
- Profile activation after promotion logged:
"building HNSW ef_construction=80 ef_search=30 seed=Some(42)"
proving the hot-swap loop is closed.
Deferred:
- Bayesian optimization (random-grid is fine at this config-space size)
- Append-triggered autotune (Phase 17.5 — refresh OnAppend policy
can schedule autotune after appending sufficient new rows)
- Concurrent autotune per index guard (JobTracker integration)
PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now
real code — promote is atomic, rollback is always available, the
active generation is a persistent pointer not a runtime convention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.
Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
trial winner.
- bound_datasets can reference raw dataset names OR AiView names
(both register as DataFusion tables with the same name, so mixing
raw tables and PII-redacted views composes naturally)
Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile
HTTP endpoints:
- POST /catalog/profiles (create/update)
- GET /catalog/profiles (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate (HNSW hot-load)
- POST /vectors/profile/{id}/search (scope-enforced)
Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
working" per ADR-017
Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
degradation — no "must activate first" friction)
Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.
End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
→ 403 "profile 'security-analyst' is not bound to 'candidates' —
allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
search without activation fell through to brute_force (graceful)
Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
— Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.
Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
actor, reason }
- All tombstones for a dataset must share one row_key_column —
enforced at write so the query-time filter remains a single
WHERE NOT IN (...) clause
Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
(POST .../tombstones/compact will work once we expose it)
Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd
HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
{ row_key_column, row_key_values[], actor, reason }
Returns rows_tombstoned count + per-value failure list (207 on
partial success).
- GET same path lists active tombstones with full audit info.
Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
becomes DataFusion view with
SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead
End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
(Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk
Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.
Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
doesn't read tombstones during merge. Tombstoned rows are still
on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
tombstone records carry their own actor+reason+timestamp so the
audit trail is intact, but cross-referencing with the mutation
event log would help compliance reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.
Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }
Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup
Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
redaction expressions per column
- Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
- Hash: digest(value, 'sha256')
- Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
real view, not a query rewrite
HTTP (catalogd::service):
- POST /catalog/views (create)
- GET /catalog/views (list)
- GET /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}
End-to-end test on candidates (100K rows, 15 columns):
candidates_safe view:
columns: candidate_id, first_name, city, state, vertical,
skills, years_experience, status
row_filter: status != 'blocked'
redaction: candidate_id mask(prefix=3, suffix=2)
SELECT * FROM candidates_safe LIMIT 5
-> 8 columns only, candidate_id shown as "CAN******01"
(PII fields email/phone/last_name absent from result)
SELECT email FROM candidates_safe
-> fails (column not in projection)
SELECT email FROM candidates
-> succeeds (raw table still accessible by name —
Phase 13 access control is the gate, not the view itself)
Survives restart — view definitions reload from object storage.
Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
redactions could be misleading (a Hash on Int64 returns a hex string
that won't equi-join with another hash on the same value type)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three pieces of the multi-bucket federation made real:
1. Catalog migration (POST /catalog/migrate-buckets)
- One-shot normalizer for ObjectRef.bucket field
- Empty -> "primary"; legacy "data"/"local" -> "primary"
- Idempotent; re-running on canonical state is no-op
- Ran on existing catalog: 12 refs renamed from "data", 2 already
"primary", all 14 now canonical
2. X-Lakehouse-Bucket header middleware on ingest
- resolve_bucket() helper extracts header, returns
(bucket_name, store) or 404 with valid bucket list
- ingest_file and ingest_db_stream now route writes per-request
- Defaults to "primary" when header absent
- pipeline::ingest_file_to_bucket records the actual bucket on the
ObjectRef so catalog stays the source of truth for "where does this
data live"
- Verified: ingest with X-Lakehouse-Bucket: testing lands in
data/_testing/, ingest without header lands in data/, bad header
returns 404 with hint
3. queryd registers every bucket with DataFusion
- QueryEngine now holds Arc<BucketRegistry> instead of single store
- build_context iterates all buckets, registers each as a separate
ObjectStore under URL scheme "lakehouse-{bucket}://"
- ListingTable URLs include the per-object bucket scheme so
DataFusion routes scans automatically based on ObjectRef.bucket
- Profile bucket names like "profile:user" sanitized to
"lakehouse-profile-user" since URL host segments can't contain ":"
- Tolerant of duplicate manifest entries (pre-existing
pipeline::ingest_file behavior creates a fresh dataset id per
ingest); duplicates skipped with debug log
- Backward compat: legacy "lakehouse://data/" URL still registered
pointing at primary
Success gate: cross-bucket CROSS JOIN
SELECT p.name, p.role, a.species
FROM people_test p (bucket: testing)
CROSS JOIN animals a (bucket: primary)
LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.
No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.
Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
eval sets per-profile)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.
Results (see docs/ADR-019-vector-storage.md for full scorecard):
Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold)
Disk size: 330.3 MB vs 330.4 MB (tie)
Search p50: 873us vs 2229us (Parquet 2.55× faster)
Search p95: 1413us vs 4998us (Parquet 3.54× faster)
Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster)
Random access: 35ms (scan) vs 311us (Lance 112× faster)
Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win)
Decision (ADR-019): hybrid, not migrate-or-reject.
- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
architecturally: random row access (RAG text fetch), append-heavy
pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
instead of "migrate" since Lance runs alongside, not instead of
Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PHASES.md and project memory updated to reflect actual build state.
Phases 11-14 were built but trackers weren't updated.
Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- /read-mem skill: reads PRD, phases, decisions, checks live services
- Updated PHASES.md with all 15 phases tracked
- Updated project_lakehouse.md memory with full context
- Updated CLAUDE.md with project reference
- Skill at ~/.claude/skills/read-mem/ and project level
- Triggers on: "read mem", "project status", "where were we", "catch me up"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem
- queryd: ListingTable registration from catalog ObjectRefs with schema inference
- queryd: POST /query/sql returns JSON {columns, rows, row_count}
- queryd→catalogd wiring: reads all datasets, registers as named tables
- gateway: wires QueryEngine with shared store + registry
- e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>