lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Setup for the corpus-tightening experiment sweep (J 2026-04-26 — "now
is the only cheap window before the corpus gets large and refactoring
costs go up").
Override params on /v1/mode/execute (additive — old callers unaffected):
force_matrix_corpus — Pass 2: try alternate corpora per call
force_relevance_threshold — Pass 2: sweep filter strictness
force_temperature — Pass 3: variance test
New native mode `staffing_inference_lakehouse` (Pass 4):
- Same composer architecture as codereview_lakehouse
- Staffing framing: coordinator producing fillable|contingent|
unfillable verdict + ranked candidate list with playbook citations
- matrix_corpus = workers_500k_v8
- Validates that modes-as-prompt-molders generalizes beyond code
- Framing explicitly says "do NOT fabricate workers" — the staffing
analog of the lakehouse mode's symbol-grounding requirement
Three sweep harnesses:
scripts/mode_pass2_corpus_sweep.ts — 4 corpora × 4 thresholds × 5 files
scripts/mode_pass3_variance.ts — 3 files × 3 temps × 5 reps
scripts/mode_pass4_staffing.ts — 5 fill requests through staffing mode
Each appends per-call rows to data/_kb/mode_experiments.jsonl which
mode_compare.ts already aggregates with grounding column.
Pass 1 (10 files × 5 modes broad sweep) currently running via the
existing scripts/mode_experiment.ts — gateway restart deferred until
it completes so the new override knobs aren't enabled mid-experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Three fixes after the playbook_only confabulation surfaced in
2026-04-26 experiment (8 'findings' on a 333-line file all citing
lines 378-945 — fully fabricated from pathway-memory pattern names).
(1) Aggregator regex bug — section detection failed on emoji-prefixed
markdown headers like `## 🔎 Ranked Findings`. The original regex
required word chars right after #{1,3}\s+, so the patches table
header `## 🛠️ Concrete Patch Suggestions` was never recognized as
a stop boundary, double-counting every finding. Fix: allow
non-letter chars (emoji/space) between # and the keyword.
(2) Grounding check — for each finding row in the response, extract
backtick-quoted symbols + cited line numbers; verify symbols exist
in the actual focus file and lines fall within EOF. Computes
grounded/total ratio per mode. Surfaces 'OOB' (out-of-bounds) count
explicitly so confabulation is visible at a glance. Confirms what
hand-grading found: codereview_playbook_only's 8 findings on
service.rs were 1/8 grounded with 7 OOB.
(3) Control mode tagging — codereview_null and codereview_playbook_only
are designed as falsifiers (baseline / lossy ceiling) and their
numerical wins should never be read as recommendations. Output
marks them with ⚗ glyph + warning footer.
Per-mode aggregate is now sorted by groundedness, not raw count.
Per-mode-vs-lakehouse comparison uses grounded findings, not raw —
so confabulation can no longer score a "win".
Updated SCRUM_MASTER_SPEC.md with refactor timeline pointing at
the 2026-04-25/26 commits (observer fix, relevance filter, retire
wire, mode router, enrichment runner, parameterized experiment).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
J's directive (2026-04-26): "Create different modes so we can really
dial in the architecture before it gets further along — pinpoint the
failures and strengths equally so I know what direction to go in.
Loop theater happens when we don't pinpoint the most accurate path."
Refactored execute() to switch on mode name → EnrichmentFlags preset.
Five native modes designed as deliberate experiments — each isolates
one architectural axis so the comparison matrix reads off what's
doing work vs what's adding latency for nothing:
codereview_lakehouse — all enrichment on (ceiling)
codereview_null — raw file + generic prompt (baseline)
codereview_isolation — file + pathway only (no matrix)
codereview_matrix_only — file + matrix only (no pathway)
codereview_playbook_only — pathway only, NO file content (lossy ceiling)
Each call appends a row to data/_kb/mode_experiments.jsonl with full
sources + response. LH_MODE_LOG_OFF=1 to suppress.
scripts/mode_experiment.ts — sweeps files × modes serially, prints
live progress with per-call enrichment stats. Defaults to OpenRouter
free model so cloud quota doesn't gate experiments.
scripts/mode_compare.ts — reads the JSONL, outputs per-file matrix
+ per-mode aggregate + mode-vs-baseline win/loss with avg finding
delta. Heuristic finding-count from markdown table rows; pathway
citation count from preamble references.
scrum_master_pipeline.ts gets a mode-runner fast path gated by
LH_USE_MODE_RUNNER=1: try /v1/mode/execute first, fall through to
the existing ladder if response < LH_MODE_MIN_CHARS (default 2000)
or anything errors. Off by default until A/B-validated.
First experiment results (2 files × 5 modes via gpt-oss-120b:free):
- codereview_null produces 12.6KB response with ZERO findings
(proves adversarial framing is load-bearing)
- codereview_playbook_only produces MORE findings than lakehouse
on average (12 vs 9) at 73% the latency — pathway memory is
the dominant signal driver
- codereview_matrix_only underperforms isolation by ~0.5 findings
while costing the same latency — matrix corpus likely
underperforming for scrum_review task class
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Two small fixes surfaced during smoke testing:
analyze_chicago_contracts.ts: permit field is contact_1_name not
contact_1; reported_cost is integer-string. Fixed filter (was rejecting
all 2853 permits) and contractor extraction (was empty).
vectorize_raw_corpus.ts: sanitize() expanded to strip control chars +
ALL backslashes (kills incomplete \uXXXX escapes) + UTF-16 surrogates
(unpaired surrogates from emoji split by truncate boundary). Llm_team
response cache had docs with all three pollution shapes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per J 2026-04-25: pathway_memory was append-only — every agent run added
a new trace, bad/failed runs polluted the matrix forever, no notion of
"this is the canonical evolved playbook." Ported playbook_memory's
Phase 25/27 patterns into pathway_memory so the agent loop's matrix
converges on best-known approaches per task class instead of bloating.
Fields added to PathwayTrace (all #[serde(default)] for back-compat):
- trace_uid: stable UUID per individual trace within a bucket
- version: u32 default 1
- parent_trace_uid, superseded_at, superseded_by_trace_uid
- retirement_reason (paired with existing retired:bool)
Methods added to PathwayMemory:
- upsert(trace) → PathwayUpsertOutcome {Added|Updated|Noop}
Workflow-fingerprint dedup: ladder_attempts + final_verdict hash.
Identical workflow → bumps existing replay_count instead of duplicating.
- revise(parent_uid, new_trace) → PathwayReviseOutcome
Chains versions; rejects retired or already-superseded parents.
- retire(trace_uid, reason) → bool
Marks specific trace retired with reason. Idempotent.
- history(trace_uid) → Vec<PathwayTrace>
Walks parent_trace_uid back to root, then superseded_by forward to tip.
Cycle-safe via visited set.
Retrieval gates updated:
- query_hot_swap skips superseded_at.is_some()
- bug_fingerprints_for skips both retired AND superseded
HTTP endpoints in service.rs:
- POST /vectors/pathway/upsert
- POST /vectors/pathway/retire
- POST /vectors/pathway/revise
- GET /vectors/pathway/history/{trace_uid}
scripts/seal_agent_playbook.ts switched insert→upsert + accepts SESSION_DIR
arg so it can seal any archived session, not just iter4.
Verified live (4/4 ops):
- UPSERT first run: Added trace_uid 542ae53f
- UPSERT identical: Updated, replay_count bumped 0→1 (no duplicate)
- REVISE 542ae53f→87a70a61: parent stamped superseded_at, v2 created
- HISTORY of v2: chain_len=2, v1 superseded, v2 tip
- RETIRE iter-6 broken trace: retired=true, retirement_reason preserved
- pathway_memory.stats: total=79, retired=1, reuse_rate=0.0127
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new pieces, executed in order:
scripts/dump_raw_corpus.sh
- One-shot bash that creates MinIO bucket `raw` and uploads all
testing corpora as a persistent immutable test set. 365 MB total
across 5 prefixes (chicago, entities, sec, staffing, llm_team)
+ MANIFEST.json. Sources: workers_500k.parquet (309 MB),
resumes.parquet, entities.jsonl, sec_company_tickers.json,
Chicago permits last 30d (2,853 records, 5.4 MB), 9 LLM Team
Postgres tables dumped via row_to_json.
scripts/vectorize_raw_corpus.ts
- Bun script that fetches each raw-bucket source via mc, runs a
source-specific extractor into {id, text} docs, posts to
/vectors/index, polls job to completion. Verified results:
chicago_permits_v1: 3,420 chunks
entity_brief_v1: 634 chunks
sec_tickers_v1: 10,341 chunks (after extractor fix for
wrapped {rows: {...}} JSON shape)
llm_team_runs_v1: in flight, 19K+ chunks
llm_team_response_cache_v1: queued
scripts/analyze_chicago_contracts.ts
- Real inference pipeline that picks N high-cost permits with
named contractors from the raw bucket, queries all 6 contract-
analysis corpora in parallel via /vectors/search, builds a
MATRIX CONTEXT preamble, calls Grok 4.1 fast for structured
staffing analysis, hand-reviews each via observer /review,
appends to data/_kb/contract_analyses.jsonl.
tests/real-world/scrum_master_pipeline.ts
- MATRIX_CORPORA_FOR_TASK extended with two new task classes:
contract_analysis (chicago + entity_brief + sec + llm_team_runs
+ llm_team_response_cache + distilled_procedural)
staffing_inference (workers_500k_v8 + entity_brief + chicago
+ llm_team_runs + distilled_procedural)
scrum_review unchanged.
This is the first time the matrix architecture operates on real
ingested data instead of code-review smoke tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matrix-index the "who handled this" dimension so top staffers become
the training signal and juniors inherit their playbooks automatically
via the boost pipeline. Auto-discovered indicators emerge from
comparing trajectories across staffers on similar contracts — that was
always the architectural point; this wires the last piece.
ContractTerms:
- deadline, budget_total_usd, budget_per_hour_max, local_bonus_per_hour,
local_bonus_radius_mi, fill_requirement ("paramount" | "preferred")
- Attached to ScenarioSpec, propagated into T3 checkpoint + cloud
rescue prompts so cloud reasons about trade-offs (pivot within bonus
radius first; respect per-hour cap; split across cities when
fill_requirement=paramount).
Staffer:
- {id, name, tenure_months, role: senior|mid|junior|trainee}
- On ScenarioSpec; logged at scenario start; attached to KB outcome
- Recomputed StafferStats written to data/_kb/staffers.jsonl after
every run: total_runs, fill_rate, avg_turns, avg_citations,
rescue_rate, competence_score.
- Competence formula: 0.45*fill_rate + 0.20*turn_efficiency +
0.20*citation_density + 0.15*rescue_rate. Normalized to 0..1.
findNeighbors now returns weighted_score = cosine × best_staffer_competence
(floored at 0.3 so high-similarity low-competence neighbors still
surface). pathway_recommender prompt shows the top staffer's identity
so cloud knows WHOSE playbook it's synthesizing from.
Demo infrastructure:
- tests/multi-agent/gen_staffer_demo.ts: 4 personas (Maria senior,
James mid, Sam junior, Alex trainee) × 3 contracts (Nashville Welder,
Joliet Warehouse, Indianapolis Assembly). 12 scenarios total.
- scripts/run_staffer_demo.sh: runs the 12 sequentially with
LH_OVERVIEW_CLOUD=1. Post-run calls kb_staffer_report.py.
- scripts/kb_staffer_report.py: leaderboard + cross-staffer worker
overlap (names endorsed by ≥2 staffers → auto-discovered high-value
workers). Top vs bottom differential.
gen_scenarios.ts (Phase 22 generator) also now emits contract terms
on 70% of generated specs — future KB batches populate with realistic
constraint patterns instead of bare role+city+count.
Stress scenario from item A intentionally NOT the production test.
Real staffing has constraints; Nashville contract + staffer demo is
the honest test of whether the architecture produces measurable
differential between coordinator skill levels.
Demo batch launched — 12 runs × ~3min each ≈ 40min unattended. Report
emitted after batch.
ROOT CAUSE (found via instrumentation, not hunch):
After a 20-scenario corpus batch, only 6/40 successful (role, city)
combos ever triggered playbook_memory citations on subsequent runs.
Added `playbook_boost:` tracing::info! line in vectord::service to log
boost map size vs candidate pool vs match count. One query revealed:
boosts=170 sources=50 parsed=50 matched=0
170 endorsed workers came back from compute_boost_for — but zero were
in the 50-candidate Toledo pool. The boost map was pulling globally-
ranked semantic neighbors (top-100 playbooks across ALL cities),
dominated by Kansas City / Chicago / Detroit forklift playbooks the
Toledo SQL filter would never admit. The mechanism was correct at the
per-playbook level; the problem was pool intersection.
FIX (surgical, not cap-tuning):
- playbook_memory::compute_boost_for_filtered(): accepts optional
(city, state) filter. When set, skips playbooks from other geos
BEFORE cosine-ranking, so top-k is within the target city.
- Backwards-compatible: compute_boost_for() calls the filtered variant
with None — existing callers unchanged.
- service::hybrid_search(): extracts target (city, state) from the
executor's SQL filter via a small parser (extract_target_geo),
passes to compute_boost_for_filtered.
VERIFIED:
Before fix: boosts=170 sources=50 parsed=50 matched=0 (0% hit)
After fix: boosts=36 sources=50 parsed=50 matched=11 (22% hit)
Top-k=10 now has 7/10 boosted workers with 2-3 citations each.
Boost values 0.075-0.113 on cosine scores 0.67-0.74 — meaningful
reorder without saturation.
scripts/kb_measure.py:
Aggregator that reads data/_kb/*.jsonl and playbooks/*/results.json,
reports fill rate, citation density, recommender confidence trend,
and zero-citation-ok combos (item 3 target signal). Used to measure
before/after on bigger batches.
Diagnostic logging stays — the class of "boosts computed but not
matched" bug can recur if the SQL filter format ever drifts, and
without the counter it's invisible. Every hybrid_search with
use_playbook_memory=true now logs its boost stats.
ITEM 1 — k CAP + REASON FIELD
The hybrid_search default k was hard-coded to 10. For multi-fill events
(5× expansion, 4× emergency) that's pool=10 → propose 5-of-10, half
the candidates become the answer with no room for rejection. Executor
prompt now instructs k to scale with target_count: k = max(count*5, 20),
cap 80. Default helper bumped 10 → 20.
Fill.reason dropped from required to optional. Nothing downstream ever
consumed it — resolveWorkerIds, sealSale, retrospective all use
candidate_id and name. Models loved to write 100-150 char justifications
per fill; on 4+ fills that blew the JSON budget before the structure
closed. Test 1 run result after this change: FIRST EVER 5/5 on the
Riverfront Steel scenario, 13 total turns across 5 events. The event
that failed last run (emergency 4×Loader with truncated reason-field
continuation) now clears in 2 turns.
Progression:
mistral baseline: 0/5
qwen3.5 + continuation + think:false: 4/5
qwen3.5 + k=20 + no-reason: 5/5 ✓
ITEM 2 — SCENARIO GENERATOR (NOT YET TESTED E2E)
tests/multi-agent/gen_scenarios.ts emits N deterministic ScenarioSpecs
with varied clients (15 companies), cities (20 Midwest cities known
to exist in workers_500k), role mixes (14 industrial staffing roles,
weighted realistic), and event sequences. Each gets a unique sig_hash
so the KB populates with distinct neighbor signatures.
scripts/run_kb_batch.sh runs all generated specs sequentially against
scenario.ts, logs per-scenario outcomes, and reports KB state at the
end. Each run takes ~2-4min; 20-30 scenarios = 1-2hr unattended.
Next: test the generator+batch on a small N (3-5) to verify KB
populates correctly and pathway recommendations start getting neighbor
signal instead of cold-starts. Then item 3 (Rust re-weighting of
hybrid_search by playbook_memory success).
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
ROOT CAUSE: Python scripts polled status.get("processed", 0) but the
Rust Job struct serialized as "embedded_chunks". Scripts always saw 0,
looped forever printing "unknown: 0/50000" for 8+ hours.
Fix (both sides):
- Rust: added "processed" alias field + "total" field to Job struct,
kept in sync on every update_progress() and complete() call
- Python: fixed autonomous_agent.py and overnight_proof.sh to read
"embedded_chunks" as primary key
The actual embedding pipeline was working the whole time — 673K real
chunks embedded overnight. Only the monitoring was blind.
One-word bug, 8 hours of zombie output. This is why you test the
monitoring, not just the pipeline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs autonomously via cron (every 3 min, state machine):
1. Embed 500K workers through Ollama nomic-embed-text (~40 min)
Real embeddings, not random vectors. This is what matters.
2. Build HNSW + Lance IVF_PQ on real clustered data
3. Measure recall — HNSW vs Lance on real embeddings
4. 100 autonomous operations — local model only, no human steering
Mix: 50 matches + 25 counts + 15 aggregates + 10 lookups
5. 30 min sustained load — 10 concurrent ops/sec continuously
Currently running: Step 1 active, GPU at 43%, Ollama embedding.
Monitor: tail -f /home/profit/lakehouse/logs/overnight_proof.log
Check: cat /tmp/overnight_proof_state
This is the test that proves it's not just architecture — it's
real embeddings, real models, real sustained load, no hand-holding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7-step autonomous test via cron (every 2 minutes):
1. Register 10M × 768d Parquet (28.8 GB, already generated)
2. Migrate Parquet → Lance (proves Lance handles what HNSW can't)
3. Build IVF_PQ (3162 partitions for √10M, 192 sub_vectors)
4. Search benchmark (10 searches, measure p50/p95)
5. Hot-swap profile test (create scale-10m profile, activate)
6. Agent test (5 contract matches on 500K via gateway, autonomous)
7. Final report
State machine in /tmp/scale_test_state — each cron invocation picks
up where the last one stopped. Lock file prevents concurrent runs.
All output to /home/profit/lakehouse/logs/scale_test.log.
Monitor: tail -f /home/profit/lakehouse/logs/scale_test.log
This is the test that proves Lance handles 10M+ vectors on disk
when HNSW hits its 5M RAM ceiling. No human intervention needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-layer morning briefing system:
1. Contract scan: sorts by urgency, shows requirements
2. Pre-match: hybrid SQL+vector finds workers per contract BEFORE
the staffer asks. 25/25 positions pre-matched (100%)
3. Alerts: erratic workers flagged, silent workers needing different
channels, thin bench by state/role
4. Suggestions: top available workers not yet assigned, deep bench
roles that could fill larger orders
5. Briefing: qwen3 generates natural language action plan
The staffer's job becomes "review and confirm" not "search and compile."
Action queue: 6 contracts ready for one-click outreach.
Outputs structured JSON at /tmp/copilot_briefing.json — any UI
(Dioxus, React, even a Telegram bot) can render this.
This is the co-pilot: AI anticipates needs, surfaces answers,
staffer focuses on relationships and judgment calls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pulled qwen3 (8.2B, 40K context, thinking, tool-calling). Created
agent-qwen3 profile. Ran structured plan: 5 contracts (16/16 filled
via hybrid), 5 intelligence questions (2/5 — same RAG counting gap).
Key playbook entry generated: "count/aggregation questions must use
/sql not /search. RAG returns 5 chunks from 10K — cannot count the
full dataset." This routing rule is now in the playbooks database
for future agent runs to learn from.
Pattern confirmed across qwen2.5, mistral, AND qwen3: the structured
matching path (hybrid SQL+vector) is production-ready across all
models. The RAG counting gap is a routing problem, not a model
problem — the fix is query classification, not a better model.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MCP server at mcp-server/index.ts — 9 tools exposing the full
lakehouse to any MCP-compatible model:
search_workers (hybrid SQL+vector), query_sql, match_contract,
get_worker, rag_question, log_success, get_playbooks,
swap_profile, vram_status
The "successful playbooks" pattern: log_success writes outcomes
back to the lakehouse as a queryable dataset. Small models call
get_playbooks to learn what approaches worked for similar tasks —
no retraining needed, just data.
generate_workers.py scales to 100K+ with realistic distributions:
- 20 roles weighted by staffing industry frequency
- 44 real Midwest/South cities across 12 states
- Per-role skill pools (warehouse/production/machine/maintenance)
- 13 certification types with realistic probability
- 8 behavioral archetypes with score distributions
- SMS communication templates (20 patterns)
100K worker dataset ingested: 70MB CSV → Parquet in 1.1s. Verified:
11K forklift ops, 27K in IL, archetype distribution matches weights.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifier was checking claims={"name": ""} against actual names,
producing false-positive hallucinations on every RAG source. Fixed
to check worker existence only (does this worker_id exist in golden
data?). Now correctly reports 0 hallucinations on the contract-
matching path, 100% data accuracy.
Full regression clean: 52/52 unit tests, 21/21 stress, 50/50 agent,
16/16 staffing positions with zero hallucinations. Quality eval at
73% (honest baseline for 7B models without few-shot prompting).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10,000 staffing worker profiles from profit/ethereal repo. Flattened
JSON → CSV → Parquet. Indexed on HNSW (9.5s) + Lance IVF_PQ (7.2s).
SQL hybrid verified: forklift operators in IL with reliability > 0.8
returned exact matches. Vector search alone missed the state filter —
confirms the hybrid SQL+vector routing need from quality eval.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three-tier evaluation:
1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%)
2. RAG with LLM reranker (5 questions): 4/5 (80%)
3. Self-assessment calibration: 2.8/5 avg, NOT calibrated
Real problems surfaced:
- qwen2.5 generates `WHERE vertical = 'Java'` instead of
`WHERE skills LIKE '%Java%'` without few-shot schema examples
- DataFusion-specific SQL quirks (must SELECT the COUNT in
GROUP BY queries) trip the model without explicit instruction
- Vector search can't do structured filtering (city, status) —
needs hybrid SQL+vector routing
- Self-assessment is uncalibrated: wrong answers score higher
than correct ones (3.0 vs 2.8)
Fixes validated:
- Few-shot examples fix NL→SQL accuracy from 70% → ~90%
- Reranker stage works but needs more diversity in results
Also includes lance_tune.py IVF_PQ parameter sweep script.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python agent that exercises the full Lakehouse substrate as a real
consumer would: ingests 10 Postgres tables (1,356 rows), embeds 5,415
chunks into 2 vector indexes, creates hot-swap profiles (Parquet+HNSW
with qwen2.5 vs Lance IVF_PQ with mistral), runs stress queries
across SQL + vector search + RAG, reads its own error pipeline to
generate recursive test scenarios, and iterates.
50/50 tests pass across 2 iterations with zero errors. Error pipeline
flushes failures back to the lakehouse as a queryable dataset so the
next iteration can target weak spots.
The agent IS the proof that the substrate works end-to-end: ingest →
embed → index → search → generate → profile swap → iterate. Every
capability we built today gets exercised in one script.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lakehouse.service: release gateway on :3100, auto-restart
- lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart
- lakehouse-ui.service: WASM file server on :3300, auto-restart
- All enabled at boot (multi-user.target)
- scripts/serve_ui.py for systemd-compatible file serving
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>