Architectural snapshot of the lakehouse codebase at the point where the
full matrix-driven agent loop with Mem0 versioning + deletion was
validated end-to-end.
WHAT THIS REPO IS
A clean single-commit snapshot of the lakehouse code. Heavy test data
(.parquet datasets, vector indexes) excluded — see REPLICATION.md for
regen path. Full lakehouse history at git.agentview.dev/profit/lakehouse.
WHAT WAS PROVEN
- Vector retrieval across multi-corpora matrix (chicago_permits + entity
briefs + sec_tickers + distilled procedural + llm_team runs)
- Observer hand-review (cloud + heuristic fallback) gating each candidate
- Local-model agent loop (qwen3.5:latest) with tool use + scratchpad
- Playbook seal on success → next-iter retrieval surfaces it as preamble
- Mem0 versioning + deletion in pathway_memory:
* UPSERT: ADD on new workflow, UPDATE bumps replay_count on identical
* REVISE: chains versions, parent.superseded_at + superseded_by stamped
* RETIRE: marks specific trace retired with reason, excluded from retrieval
* HISTORY: walks chain root→tip, cycle-safe
KEY DIRECTORIES
- crates/vectord/src/pathway_memory.rs — Mem0 ops live here
- crates/vectord/src/playbook_memory.rs — original Mem0 reference
- tests/agent_test/ — local-model agent harness + PRD + session archives
- scripts/dump_raw_corpus.sh — MinIO bucket dump (raw test corpus)
- scripts/vectorize_raw_corpus.ts — corpus → vector indexes
- scripts/analyze_chicago_contracts.ts — real inference pipeline
- scripts/seal_agent_playbook.ts — Mem0 upsert from agent traces
Replication: see REPLICATION.md for Debian 13 clean install + cloud-only
adaptation (no local Ollama).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
108 lines
4.0 KiB
Markdown
108 lines
4.0 KiB
Markdown
# ADR-020: Universal ID Mapping — the hybrid search identity problem
|
|
|
|
**Status:** Proposed — 2026-04-17
|
|
**Triggered by:** ID mismatch between vector doc_ids and SQL primary keys
|
|
**Owner:** J
|
|
|
|
---
|
|
|
|
## Problem
|
|
|
|
The hybrid search endpoint (`POST /vectors/hybrid`) filters SQL results
|
|
by primary key, then matches those keys against vector doc_ids. This
|
|
requires the two ID spaces to be compatible. Currently they're not:
|
|
|
|
| Source | SQL primary key | Vector doc_id | Match? |
|
|
|---|---|---|---|
|
|
| ethereal_workers | `worker_id = 4925` | `W-4925` | Only with W- strip |
|
|
| workers_500k | `worker_id = 41566` | `W500K-41566` | Only with W500K- strip |
|
|
| candidates | `candidate_id = CAND-055035` | `CAND-055035` | Yes (same format) |
|
|
| workers_5k_proof | `worker_id = 3200` | `W5K-3200` | Only with W5K- strip |
|
|
|
|
Every time a new dataset is ingested and embedded, a new prefix
|
|
appears and the hybrid search breaks until someone hardcodes another
|
|
`strip_prefix` line.
|
|
|
|
This violates PRD invariant: **"Any data source can be ingested
|
|
without pre-defined schemas."** The ID format IS a pre-defined schema
|
|
that the hybrid search depends on.
|
|
|
|
## Root cause
|
|
|
|
When the embedding pipeline creates a vector index, it generates
|
|
doc_ids by concatenating a prefix + the source row's ID column. The
|
|
prefix is chosen at index creation time and baked into the Parquet
|
|
vector file. The SQL dataset knows nothing about this prefix.
|
|
|
|
The hybrid search then has to reverse-engineer the prefix to match
|
|
vector results against SQL rows. This is fragile and breaks on every
|
|
new data source.
|
|
|
|
## Solution: catalog-level ID mapping
|
|
|
|
### Option A: Normalize doc_ids at embedding time (simplest)
|
|
|
|
When creating a vector index, don't prefix the doc_id. Use the raw
|
|
value from the source column. If the source has `worker_id = 4925`,
|
|
the vector doc_id is just `"4925"` — no `W-`, no `W500K-`.
|
|
|
|
**Pros:** Simplest. Hybrid search just compares strings directly.
|
|
**Cons:** Doc_ids across different indexes could collide (two datasets
|
|
both have worker_id=1). Need to scope by index name.
|
|
|
|
### Option B: Catalog stores the mapping (most robust)
|
|
|
|
The catalog maintains a mapping table:
|
|
```
|
|
index_name | doc_id_prefix | source_dataset | source_id_column
|
|
-----------+---------------+----------------+-----------------
|
|
resumes | CAND- | candidates | candidate_id
|
|
workers_v1 | W500K- | workers_500k | worker_id
|
|
```
|
|
|
|
Hybrid search reads this mapping and applies the prefix/strip
|
|
automatically. New datasets register their mapping at index creation.
|
|
|
|
**Pros:** Handles all cases. Self-describing. No code changes for new data.
|
|
**Cons:** One more lookup per hybrid search (trivial perf cost).
|
|
|
|
### Option C: Pass the mapping in the request (pragmatic)
|
|
|
|
The hybrid search request already accepts `id_column`. Extend it
|
|
with `id_prefix` so the caller says "vector doc_ids have prefix
|
|
W500K- and the SQL column is worker_id."
|
|
|
|
```json
|
|
{
|
|
"index_name": "workers_500k_v1",
|
|
"id_column": "worker_id",
|
|
"id_prefix": "W500K-",
|
|
"sql_filter": "role = 'Forklift Operator' AND state = 'IL'",
|
|
...
|
|
}
|
|
```
|
|
|
|
**Pros:** Zero backend change. Caller already knows the context.
|
|
**Cons:** Caller has to know the prefix — not self-service.
|
|
|
|
## Recommendation
|
|
|
|
**Option A for new indexes, Option B for registry.**
|
|
|
|
1. Change the embedding pipeline to use RAW IDs (no prefix) as the
|
|
default. Existing indexes keep their prefixed IDs.
|
|
2. Add an `id_prefix` column to IndexMeta so the catalog knows how
|
|
to map between vector and SQL IDs.
|
|
3. Hybrid search reads IndexMeta.id_prefix and applies it automatically.
|
|
4. Remove the hardcoded strip_prefix chain.
|
|
|
|
This means: ingest a new dataset, embed it, hybrid search works
|
|
immediately. No code changes. The system is truly "any data source."
|
|
|
|
## Implementation
|
|
|
|
1. `IndexMeta` gains `id_prefix: Option<String>` (default None = no prefix)
|
|
2. Embedding pipeline: when `id_prefix` is None, use raw IDs from source
|
|
3. Hybrid search: read `IndexMeta.id_prefix`, prepend to SQL IDs before matching
|
|
4. Migration: existing indexes retain their prefix in IndexMeta
|