Architectural snapshot of the lakehouse codebase at the point where the
full matrix-driven agent loop with Mem0 versioning + deletion was
validated end-to-end.
WHAT THIS REPO IS
A clean single-commit snapshot of the lakehouse code. Heavy test data
(.parquet datasets, vector indexes) excluded — see REPLICATION.md for
regen path. Full lakehouse history at git.agentview.dev/profit/lakehouse.
WHAT WAS PROVEN
- Vector retrieval across multi-corpora matrix (chicago_permits + entity
briefs + sec_tickers + distilled procedural + llm_team runs)
- Observer hand-review (cloud + heuristic fallback) gating each candidate
- Local-model agent loop (qwen3.5:latest) with tool use + scratchpad
- Playbook seal on success → next-iter retrieval surfaces it as preamble
- Mem0 versioning + deletion in pathway_memory:
* UPSERT: ADD on new workflow, UPDATE bumps replay_count on identical
* REVISE: chains versions, parent.superseded_at + superseded_by stamped
* RETIRE: marks specific trace retired with reason, excluded from retrieval
* HISTORY: walks chain root→tip, cycle-safe
KEY DIRECTORIES
- crates/vectord/src/pathway_memory.rs — Mem0 ops live here
- crates/vectord/src/playbook_memory.rs — original Mem0 reference
- tests/agent_test/ — local-model agent harness + PRD + session archives
- scripts/dump_raw_corpus.sh — MinIO bucket dump (raw test corpus)
- scripts/vectorize_raw_corpus.ts — corpus → vector indexes
- scripts/analyze_chicago_contracts.ts — real inference pipeline
- scripts/seal_agent_playbook.ts — Mem0 upsert from agent traces
Replication: see REPLICATION.md for Debian 13 clean install + cloud-only
adaptation (no local Ollama).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
79 lines
2.0 KiB
TOML
79 lines
2.0 KiB
TOML
# Lakehouse Configuration
|
|
|
|
[gateway]
|
|
host = "0.0.0.0"
|
|
port = 3100
|
|
|
|
[storage]
|
|
root = "./data"
|
|
profile_root = "./data/_profiles"
|
|
rescue_bucket = "rescue"
|
|
|
|
[[storage.buckets]]
|
|
name = "primary"
|
|
backend = "local"
|
|
root = "./data"
|
|
|
|
[[storage.buckets]]
|
|
name = "rescue"
|
|
backend = "local"
|
|
root = "./data/_rescue"
|
|
|
|
[[storage.buckets]]
|
|
name = "testing"
|
|
backend = "local"
|
|
root = "./data/_testing"
|
|
|
|
# S3 bucket via MinIO. The name "s3:lakehouse" is the convention
|
|
# lance_backend.rs uses to emit s3:// URIs for Lance datasets.
|
|
# Credentials resolved via environment (AWS_ACCESS_KEY_ID etc) or
|
|
# the secrets provider.
|
|
[[storage.buckets]]
|
|
name = "s3:lakehouse"
|
|
backend = "s3"
|
|
bucket = "lakehouse"
|
|
endpoint = "http://localhost:9000"
|
|
region = "us-east-1"
|
|
secret_ref = "minio-lakehouse"
|
|
|
|
[catalog]
|
|
# Manifests persisted to object storage under this prefix
|
|
manifest_prefix = "_catalog/manifests"
|
|
|
|
[query]
|
|
# max_rows_per_query = 10000
|
|
|
|
[sidecar]
|
|
url = "http://localhost:3200"
|
|
|
|
[ai]
|
|
embed_model = "nomic-embed-text"
|
|
gen_model = "qwen2.5"
|
|
rerank_model = "qwen2.5"
|
|
|
|
[auth]
|
|
enabled = false
|
|
# api_key = "changeme"
|
|
|
|
[observability]
|
|
# Export traces to stdout (set to "otlp" for OpenTelemetry collector)
|
|
exporter = "stdout"
|
|
service_name = "lakehouse"
|
|
|
|
[agent]
|
|
# Phase 16.2 — background autotune agent. Opt-in: set enabled = true to
|
|
# let the agent continuously propose + trial HNSW configs and auto-promote
|
|
# winners. Defaults are conservative so it stays out of the way of live
|
|
# search traffic on shared Ollama.
|
|
enabled = true
|
|
cycle_interval_secs = 120 # periodic wake if no triggers
|
|
cooldown_between_trials_secs = 10 # min gap between trials
|
|
min_recall = 0.9 # never promote below this
|
|
max_trials_per_hour = 20 # hard budget cap
|
|
|
|
# Model roster — available for profile hot-swap
|
|
# qwen3: 8.2B, 40K context, thinking+tools, best for reasoning tasks
|
|
# qwen2.5: 7B, 8K context, fast, good for SQL generation
|
|
# mistral: 7B, 8K context, good for general generation
|
|
# nomic-embed-text: 137M, embedding-only, used by all profiles
|