lakehouse

Author	SHA1	Message	Date
root	2cac64636c	docs: PHASES tracker — mark Phases 42/43/44/45 complete Today's work shipped four Phase closures (Truth Layer, Validation Pipeline, Caller Migration, Doc-Drift Detection); the canonical tracker now reflects them. Foundation for production switchover (real Chicago data replaces synthetic test data soon). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:03:40 -05:00
root	4251e94531	Update PHASES.md: Phase 41 + Guard fixes - Phase 41: ProfileType enum, per-type endpoints - Guard: scrumaudit.py, fixed watcher.sh + pr-reviewer.md	2026-04-23 03:09:05 -05:00
root	55f8e0fe6e	Phase 40: Routing Engine + Policy - RoutingEngine with RouteDecision (model_pattern → provider) - config/routing.toml: rules, fallback chain, cost gating - Per-provider Usage tracking in /v1/usage response - 12 gateway tests green	2026-04-23 02:36:45 -05:00
root	e27a17e950	Phase 39: Provider Adapter Refactor - ProviderAdapter trait with chat(), embed(), unload(), health() - OllamaAdapter wrapping existing AiClient - OpenRouterAdapter for openrouter.ai API integration - provider_key() routing by model prefix (openrouter/*, etc)	2026-04-23 02:24:15 -05:00
root	21e8015b60	Phase 37: Hot-swap async + Phase 38: Universal API skeleton - JobTracker extended with JobType::ProfileActivation + Embed - activate_profile returns job_id immediately, work spawns in background - /v1/chat, /v1/usage, /v1/sessions endpoints (OpenAI-compatible) - Langfuse trace integration (Phase 40 early deliverable) - 12 gateway unit tests green, curl gates pass	2026-04-23 01:56:17 -05:00
profit	a6f12e2609	Phase 21 Rust port + Phase 27 playbook versioning + doc-sync Phase 21 — Rust port of scratchpad + tree-split primitives (companion to the 2026-04-21 TS shipment). New crates/aibridge modules: context.rs — estimate_tokens (chars/4 ceil), context_window_for, assert_context_budget returning a BudgetCheck with numeric diagnostics on both success and overflow. Windows table mirrors config/models.json. continuation.rs — generate_continuable<G: TextGenerator>. Handles the two failure modes: empty-response from thinking models (geometric 2x budget backoff up to budget_cap) and truncated-non-empty (continuation with partial as scratchpad). is_structurally_complete balances braces then JSON.parse-checks. Guards the degen case "all retries empty, don't loop on empty partial". tree_split.rs — generate_tree_split map->reduce with running scratchpad. Per-shard + reduce-prompt go through assert_context_budget first; loud-fails rather than silently truncating. Oldest-digest-first scratchpad truncation at scratchpad_budget (default 6000 t). TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient implements it; ScriptedGenerator test double lets tests inject canned sequences without a live Ollama. GenerateRequest gained think: Option<bool> — forwards to sidecar for per-call hidden-reasoning opt-out on hot-path JSON emitters. Three existing callsites updated (rag.rs x2, service.rs hybrid answer). Phase 27 — Playbook versioning. PlaybookEntry gained four optional fields (all #[serde(default)] so pre-Phase-27 state loads as roots): version u32, default 1 parent_id Option<String>, previous version's playbook_id superseded_at Option<String>, set when newer version replaces superseded_by Option<String>, the playbook_id that replaced New methods: revise_entry(parent_id, new_entry) — appends new version, stamps superseded_at+superseded_by on parent, inherits parent_id and sets version = parent + 1 on the new entry. Rejects revising a retired or already-superseded parent (tip-of-chain is the only valid revise target). history(playbook_id) — returns full chain root->tip from any node. Walks parent_id back to root, then superseded_by forward to tip. Cycle-safe. Superseded entries excluded from boost (same rule as retired): filter in compute_boost_for_filtered_with_role (both active-entries prefilter and geo-filtered path), rebuild_geo_index, and upsert_entry's existing- idx search. status_counts returns (total, retired, superseded, failures); /status JSON reports active = total - retired - superseded. Endpoints: POST /vectors/playbook_memory/revise GET /vectors/playbook_memory/history/{id} Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26 shipped. Fixes applied: - Phase 24 marked shipped (commit b95dd86) with detail of observer HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED" rewritten to reflect shipped state. - Phase 25 (validity windows, commit e0a843d) added to PHASES + PRD. - Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added. - Phase 27 entry added to both docs. - Phase 19.6 time decay corrected: was documented as "deferred", actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs. - Phase E/Phase 8 tombstone-at-compaction limit note updated — Phase E.2 closed it. Tests: 8 new version_tests in vectord (chain-metadata stamping, retired/superseded parent rejection, boost exclusion, history from root/tip/middle, legacy default round-trip, status counts). 25 new aibridge tests (context/continuation/tree_split). Workspace total 145 green (was 120). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:40:49 -05:00
root	137aed64fb	Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests J flagged the audit: "make sure everything flows coherently, no pseudocode or unnecessary patches or ignoring any particular part of what we built." This is that pass. PRD.md updates: - Phase 19 refinement block — geo-filter + role-prefilter WIRED with citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario). - Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path, think:false as the key mechanical finding, kimi-k2.6 upgrade path. - Phase 21 status block — think plumbing + cloud executor routing added after original commit. - Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified 1/3 on stress_01. - Phase 23 NEW — staffer identity + tool_level + competence-weighted retrieval + kb_staffer_report. Auto-discovered worker labels called out with real numbers (Rachel Lewis 12× across 4 staffers). - Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not fixed. Observer has been idle at 0 ops for 3600+ cycles because scenarios hit gateway:3100 directly, bypassing MCP:3700 which the observer wraps. This is the honest "we're not using it in these tests" signal J surfaced. Fix deferred; gap visible now. PHASES.md: - Appended Phases 20-23 as checked, Phase 24 as unchecked gap. - Updated footer count: 102 unit tests across all layers. - Latest line updated with 14× citation lift + 46.4pt tool-asymmetry finding. scenario.ts: - snapshotConfig() was defined but never called. Now fires at every scenario start with a stable sha256 hash over the active model set + tool_level + cloud flags. config_snapshots.jsonl finally populates, which the error_corrections diff path needs to work correctly. kb.test.ts (new): 4 signature invariant tests — stability across unrelated fields (date, contract, staffer), sensitivity to role/city/ count changes, digest shape. All pass under `bun test`. service.rs: 6 Rust extractor tests for extract_target_geo + extract_target_role — basic, missing-state-returns-none, word boundary (civilian != city), multi-word role, absent role, quoted value parse. All pass under `cargo test -p vectord --lib extractor_tests`. Dangling items now honestly documented rather than silently pending: - Chunking cache (config/models.json SPEC, not wired) — flagged - Playbook versioning (SPEC, not wired) — flagged - Observer integration (WIRED but disconnected) — new Phase 24	2026-04-20 23:29:13 -05:00
root	3bc82833ac	Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push PRD status line: "Phases 0-18 shipped; hybrid operational; scheduled ingest live; PDF OCR live; entering horizon items." PHASES.md: federation L2 items marked complete, Phase 16.2 (autotune agent), Phase 17 VRAM gate, MySQL connector, Phase 18 (hybrid Lance), scheduled ingest, PDF OCR all documented with dates and measurements. Stats updated: 52+ unit tests, 13 crates, 19 ADRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:54:05 -05:00
root	4e1c400f5d	Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop Phase E gave us soft-delete at query time (tombstones hide rows via a DataFusion filter view). This completes the invariant: after compact, tombstoned rows are PHYSICALLY absent from the parquet on disk. delta::compact changes: - Signature adds tombstones: &[Tombstone] - After merging base + deltas, apply_tombstone_filter builds a BooleanArray keep-mask per batch (True where row_key_value is NOT in the tombstone set) and applies arrow::compute::filter_record_batch - Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage for pg- and csv-derived schemas) - CompactResult gains tombstones_applied + rows_dropped_by_tombstones - Caller clears tombstone store on success Critical correctness fix surfaced during E2E testing: The original Phase 8 compact concatenated N independent Parquet byte streams from record_batch_to_parquet() — each with its own footer. Parquet readers only see the FIRST footer's data; the rest is invisible. Latent since Phase 8 shipped; triggered by tombstone-filtering produc- ing multiple batches. Corrupted candidates.parquet on first test run (restored from UI fixture copy — good argument for test data in repo). Fix: - Single ArrowWriter per compaction, writes every batch into one properly-footered Parquet - Snappy compression to match ingest defaults (otherwise rewrite inflated file 3× — 10.5MB → 34MB — because no compression was set) - Verify-before-swap: parse written buf back to confirm row count matches expected; refuses to overwrite base_key if verification fails - Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete temp; only then delete delta files. Any error along the way leaves the original base intact. TombstoneStore::clear(dataset) drops all tombstone batch files and evicts the per-dataset AppendLog from cache. Called after successful compact. QueryEngine::catalog() accessor exposes the Registry so queryd handlers can reach the tombstone store without routing through gateway state. E2E on candidates (100K rows, 15 cols): - Baseline: 10.59 MB, 100000 rows - Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw - Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997 - Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows - Restart: persists, tombstones list empty, __raw__candidates also 99997 (the 3 IDs are physically gone from disk) PRD invariant close: deletion is now actually deletion, not just masking. GDPR erasure request → tombstone + schedule compact → data gone. Deferred: - Compact-all-datasets cron (currently manual per-dataset via POST /query/compact) - Compaction of tombstone batch files themselves (they grow at flush_threshold=1 per tombstone; TombstoneStore::compact exists but not auto-called) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:38:30 -05:00
root	4d5c49090c	Phase 16: Hot-swap generations + autotune agent loop Closes the self-iteration loop from the PRD reframe: an agent can tune HNSW configs autonomously and the winner flows through to the next profile activation without human intervention. Three primitives: 1. PromotionRegistry (vectord::promotion) - Per-index current + history at _hnsw_promotions/{index}.json - promote(index, entry) atomically swaps current, pushes prior onto history (capped at 50) - rollback() pops history back onto current; clears current if history exhausted - config_or(index, default) — the read side used at build time, returns promoted config if set else caller's default - Full cache + persistence; writes are durable on return 2. Autotune (vectord::autotune) - run_autotune(request, ...) — synchronous agent loop - Default grid: 5 configs covering the practical range (ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for reproducibility - Every trial goes through the existing trial-journal pipeline so autotune runs land alongside manual trials in the "trials are data" log - Winner: max recall first, then min p50 latency; must clear min_recall gate (default 0.9) or no promotion happens - Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd values from the request's optional custom grid - On winner: promote with note "autotune winner: recall=X p50=Y" 3. Wiring - VectorState gains promotion_registry - activate_profile now calls promotion_registry.config_or(...) so newly-promoted configs are picked up on next activation — the "hot-swap" is: autotune promotes -> profile activates -> HNSW rebuilt with new config - New endpoints: POST /vectors/hnsw/promote/{index}/{trial_id} ?promoted_by=...&note=... POST /vectors/hnsw/rollback/{index} GET /vectors/hnsw/promoted/{index} POST /vectors/hnsw/autotune { index_name, harness, min_recall?, grid? } End-to-end verified on threat_intel_v1 (54 vectors): - autogen harness 'threat_intel_smoke' (10 queries) - POST /autotune -> 5 trials in 620ms, winner ec=20 es=30 recall=1.00 p50=64us auto-promoted - Manual promote of ec=80 es=30 -> history depth 1 - Rollback -> back to ec=20 es=30 autotune winner - Second rollback -> current cleared - Re-promote + restart -> persistence verified - Profile activation after promotion logged: "building HNSW ef_construction=80 ef_search=30 seed=Some(42)" proving the hot-swap loop is closed. Deferred: - Bayesian optimization (random-grid is fine at this config-space size) - Append-triggered autotune (Phase 17.5 — refresh OnAppend policy can schedule autotune after appending sufficient new rows) - Concurrent autotune per index guard (JobTracker integration) PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now real code — promote is atomic, rollback is always available, the active generation is a persistent pointer not a runtime convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:26:21 -05:00
root	a293502265	Phase 17: Model profiles + scoped search — the LLM-brain keystone Implements PRD invariant 9 ("every reader gets its own profile") and completes the multi-model substrate vision. Local models (or agents) bind to a named set of datasets; activation pre-loads their vector indexes into memory; search enforces scope. Schema (shared::types): - ModelProfile { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by } - ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15 trial winner. - bound_datasets can reference raw dataset names OR AiView names (both register as DataFusion tables with the same name, so mixing raw tables and PII-redacted views composes naturally) Catalog (catalogd::registry): - put_profile validates id is a slug (alphanumeric + -_ only) and every binding resolves to an existing dataset or view - Persistence at _catalog/profiles/{id}.json, loaded on rebuild - get_profile / list_profiles / delete_profile HTTP endpoints: - POST /catalog/profiles (create/update) - GET /catalog/profiles (list) - GET/DELETE /catalog/profiles/{id} - POST /vectors/profile/{id}/activate (HNSW hot-load) - POST /vectors/profile/{id}/search (scope-enforced) Activation (vectord::service::activate_profile): - For each bound dataset, find vector indexes with matching source - Pre-load embeddings into EmbeddingCache - Build HNSW with profile's config - Report warmed indexes + per-binding failures + duration - Failures on individual bindings don't abort — "substrate keeps working" per ADR-017 Scoped search (vectord::service::profile_scoped_search): - Look up profile, verify index.source ∈ profile.bound_datasets - Returns 403 with allowed bindings list if out-of-scope - Uses HNSW if index is warm, brute-force cosine otherwise (graceful degradation — no "must activate first" friction) Bug fix surfaced during testing: vectord::refresh::try_update_index_meta was a no-op for first-time indexes, so threat_intel_v1 and kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't show up in the index registry. Now it auto-infers the source from the index name convention (`{source}_vN`) and registers new metadata with reasonable defaults. End-to-end verified: - Created security-analyst profile bound to [threat_intel] - POST /vectors/profile/security-analyst/activate → warmed threat_intel_v1 (54 vectors) in 156ms, HNSW built - Within-scope search: method=hnsw, returned relevant IP indicators - Out-of-scope: tried to search resumes_100k_v2 (source=candidates) → 403 "profile 'security-analyst' is not bound to 'candidates' — allowed bindings: [\"threat_intel\"]" - staffing-recruiter profile created bound to candidates + placements; search without activation fell through to brute_force (graceful) Deferred (Phase 17 followups): - VRAM-aware activation (unload-then-load via Ollama keep_alive=0) — Ollama already handles this; we don't need to reinvent - Model-identity in audit trail — Phase 13 has role-based audit; adding model_id is ~20 LOC when we want it - Profile bucket pre-load (profile:user bucket mount) — Phase 17.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:09:43 -05:00
root	d87f2ccac6	Phase E: Soft deletes (tombstones) for compliance-grade row deletion Implements GDPR/CCPA-compatible row-level deletion without rewriting the underlying Parquet. Tombstone markers live beside each dataset and are applied at query time via a DataFusion view that excludes the deleted row_key_values. Schema (shared::types): - Tombstone { dataset, row_key_column, row_key_value, deleted_at, actor, reason } - All tombstones for a dataset must share one row_key_column — enforced at write so the query-time filter remains a single WHERE NOT IN (...) clause Storage (catalogd::tombstones): - Per-dataset AppendLog at _catalog/tombstones/{dataset}/ - flush_threshold=1 + explicit flush after every append — tombstones are high-value, low-frequency; durability on return is the contract - Reuses storaged::append_log infra so compaction is already wired (POST .../tombstones/compact will work once we expose it) Catalog (catalogd::registry): - add_tombstone validates dataset exists + key column compatibility - list_tombstones for the GET endpoint - TombstoneStore exposed via Registry::tombstones() for queryd HTTP (catalogd::service): - POST /catalog/datasets/by-name/{name}/tombstone { row_key_column, row_key_values[], actor, reason } Returns rows_tombstoned count + per-value failure list (207 on partial success). - GET same path lists active tombstones with full audit info. Query layer (queryd::context): - Snapshot tombstones-by-dataset before registering tables - Tombstoned tables: raw goes to "__raw__{name}", public "{name}" becomes DataFusion view with SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...) - CAST AS VARCHAR handles both string and integer key columns - Untombstoned tables register as before — zero overhead End-to-end on candidates (100K rows): - Pick CAND-000001/2/3 (Linda/Charles/Kimberly) - POST tombstone -> rows_tombstoned: 3 - COUNT() drops 100000 -> 99997 - WHERE candidate_id IN (those 3) -> 0 rows - candidates_safe view transitively excludes them (Linda+Denver: __raw__candidates=159, candidates_safe=158) - Restart: COUNT still 99997, 3 tombstones reload from disk Reversibility: tombstones are reversible deletes, not destruction. Power users can still query "__raw__{name}" to see deleted rows. Phase 13 access control is what stops a non-admin from accessing __raw__ tables. Limits / follow-up: - Physical compaction not yet integrated — Phase 8's compact_files doesn't read tombstones during merge. Tombstoned rows are still on disk until that integration ships. - Phase 9 journald event emission for tombstones not wired — tombstone records carry their own actor+reason+timestamp so the audit trail is intact, but cross-referencing with the mutation event log would help compliance reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:40:48 -05:00
root	09fd446c8d	Phase D: AI-safe views — capability-surface projections over base data Implements the llms3.com "AI-safe views" pattern: a named projection that exposes only whitelisted columns, with optional row filter and per-column redactions. AI agents (or Phase 13 roles) bind to the view; they can never accidentally see PII even if they write raw SQL. Schema (shared::types): - AiView { name, base_dataset, columns: Vec<String>, row_filter, column_redactions: HashMap<String, Redaction>, ... } - Redaction enum: Null \| Hash \| Mask { keep_prefix, keep_suffix } Catalog (catalogd::registry): - put_view validates base dataset exists + columns non-empty - Persists JSON at _catalog/views/{name}.json (sanitized name) - rebuild() loads views alongside dataset manifests on startup Query layer (queryd::context): - build_context registers every AiView as a DataFusion view object - Constructed SELECT applies whitelist projection, WHERE filter, and redaction expressions per column - Mask: substr(prefix) + repeat('', mid_len) + substr(suffix) - Hash: digest(value, 'sha256') - Null: CAST(NULL AS VARCHAR) AS col - DataFusion handles JOINs/aggregates over the view natively — it's a real view, not a query rewrite HTTP (catalogd::service): - POST /catalog/views (create) - GET /catalog/views (list) - GET /catalog/views/{name} (full def) - DELETE /catalog/views/{name} End-to-end test on candidates (100K rows, 15 columns): candidates_safe view: columns: candidate_id, first_name, city, state, vertical, skills, years_experience, status row_filter: status != 'blocked' redaction: candidate_id mask(prefix=3, suffix=2) SELECT FROM candidates_safe LIMIT 5 -> 8 columns only, candidate_id shown as "CAN******01" (PII fields email/phone/last_name absent from result) SELECT email FROM candidates_safe -> fails (column not in projection) SELECT email FROM candidates -> succeeds (raw table still accessible by name — Phase 13 access control is the gate, not the view itself) Survives restart — view definitions reload from object storage. Limits / not in MVP: - View CANNOT shadow base table by name (DataFusion treats them as separate identifiers; access control must restrict raw-table access) - row_filter is treated as trusted SQL — operators must validate before persisting; only authenticated admin path should call put_view - Redaction expressions assume column is castable to VARCHAR; numeric redactions could be misleading (a Hash on Int64 returns a hex string that won't equi-join with another hash on the same value type) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:16:44 -05:00
root	24f1249a62	Federation layer 2: header routing + cross-bucket SQL Three pieces of the multi-bucket federation made real: 1. Catalog migration (POST /catalog/migrate-buckets) - One-shot normalizer for ObjectRef.bucket field - Empty -> "primary"; legacy "data"/"local" -> "primary" - Idempotent; re-running on canonical state is no-op - Ran on existing catalog: 12 refs renamed from "data", 2 already "primary", all 14 now canonical 2. X-Lakehouse-Bucket header middleware on ingest - resolve_bucket() helper extracts header, returns (bucket_name, store) or 404 with valid bucket list - ingest_file and ingest_db_stream now route writes per-request - Defaults to "primary" when header absent - pipeline::ingest_file_to_bucket records the actual bucket on the ObjectRef so catalog stays the source of truth for "where does this data live" - Verified: ingest with X-Lakehouse-Bucket: testing lands in data/_testing/, ingest without header lands in data/, bad header returns 404 with hint 3. queryd registers every bucket with DataFusion - QueryEngine now holds Arc<BucketRegistry> instead of single store - build_context iterates all buckets, registers each as a separate ObjectStore under URL scheme "lakehouse-{bucket}://" - ListingTable URLs include the per-object bucket scheme so DataFusion routes scans automatically based on ObjectRef.bucket - Profile bucket names like "profile:user" sanitized to "lakehouse-profile-user" since URL host segments can't contain ":" - Tolerant of duplicate manifest entries (pre-existing pipeline::ingest_file behavior creates a fresh dataset id per ingest); duplicates skipped with debug log - Backward compat: legacy "lakehouse://data/" URL still registered pointing at primary Success gate: cross-bucket CROSS JOIN SELECT p.name, p.role, a.species FROM people_test p (bucket: testing) CROSS JOIN animals a (bucket: primary) LIMIT 5 returns rows correctly. DataFusion routed each scan to its bucket's ObjectStore based on the URL scheme. No regressions: SELECT COUNT(*) FROM candidates still returns 100000 from the primary bucket. Deferred to Phase 17: - POST /profile/{user}/activate (HNSW hot-load on profile switch) - vectord storage paths becoming bucket-scoped (trial journals, eval sets per-profile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:52:32 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	8282842eaf	Sync memory + phases: all 15 phases marked complete PHASES.md and project memory updated to reflect actual build state. Phases 11-14 were built but trackers weren't updated. Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 19:34:15 -05:00
root	6d49f81ebf	Add read-mem skill + comprehensive project memory - /read-mem skill: reads PRD, phases, decisions, checks live services - Updated PHASES.md with all 15 phases tracked - Updated project_lakehouse.md memory with full context - Updated CLAUDE.md with project reference - Skill at ~/.claude/skills/read-mem/ and project level - Triggers on: "read mem", "project status", "where were we", "catch me up" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:23:01 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	50a8c8013f	Phase 4: Dioxus frontend with dataset browser and SQL query editor - ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table - ui: dynamic API base URL (same-origin for nginx, port-based for local dev) - gateway: CORS enabled for cross-origin requests - nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin - justfile: ui-build, ui-serve, sidecar, up commands added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:24:15 -05:00
root	239e471223	Phase 3: AI integration with Ollama via Python sidecar - sidecar: FastAPI app with /embed, /generate, /rerank hitting Ollama - sidecar: Dockerfile, env var config (EMBED_MODEL, GEN_MODEL, RERANK_MODEL) - aibridge: reqwest HTTP client with typed request/response structs - aibridge: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank) - gateway: wires AiClient with SIDECAR_URL env var - e2e verified: nomic-embed-text returns 768d vectors, qwen2.5 generates text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:53:56 -05:00
root	19bdfab227	Phase 2: DataFusion query engine over Parquet - queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem - queryd: ListingTable registration from catalog ObjectRefs with schema inference - queryd: POST /query/sql returns JSON {columns, rows, row_count} - queryd→catalogd wiring: reads all datasets, registers as named tables - gateway: wires QueryEngine with shared store + registry - e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:48:20 -05:00
root	655b6c0b37	Phase 1: storage + catalog layer - storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints - shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests) - catalogd: in-memory registry with write-ahead manifest persistence to object storage - catalogd: POST/GET /datasets, GET /datasets/by-name/{name} - gateway: wires storaged + catalogd with shared object_store state - Phase tracker updated: Phase 0 + Phase 1 gates passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:15:27 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

24 Commits