Lakehouse — Architecture & Reproduction

Chapter 1

Receipts, not promises

Every test below ran live against the real gateway when you loaded this page. Sub-100ms SQL on multi-million-row Parquet, hybrid search with playbook boost applied, public-issuer attribution computed from this view. No fixtures. If a test fails, you'll see ✗.

Running tests…

Chapter 2

Architecture — 15 crates, one object store, a 5-provider model fleet

Gateway is a drop-in OpenAI-compatible middleware. Any consumer that speaks the OpenAI Chat Completions shape — agent SDKs, IDE plugins, custom apps — points at localhost:3100/v1 and gets routing, audit, and the full memory substrate behind every call. The model side has 5 providers and 40+ frontier models reachable via one OpenCode key. The data side stays Rust-first.

     OpenAI SDK consumers          MCP clients         Browser UI (Bun :3700)
              │                          │                          │
              └──────────────────────────┼──────────────────────────┘
                                         ▼
                           ┌──────────────────────────────┐
                           │   gateway   :3100  /v1/*     │  Rust · Axum
                           │   OpenAI-compat drop-in      │  smart provider routing
                           │   /v1/chat /v1/mode /iterate │  cost telemetry, Langfuse
                           └──────────┬───────────────────┘
            ┌─────────┬───────────────┼───────────────┬──────────┐
            │         │               │               │          │
       ┌────▼───┐ ┌───▼────┐    ┌─────▼──────┐  ┌─────▼─────┐ ┌──▼──────┐
       │catalog │ │ query  │    │   vector   │  │  ingest   │ │aibridge │
       │   d    │ │   d    │    │     d      │  │     d     │ │         │
       │idempot │ │DataFus │    │HNSW · Lance│  │CSV PDF SQL│ │provider │
       │schema  │ │delta   │    │playbook+   │  │auto-PII   │ │adapters │
       │fingerp │ │MemTabl │    │pathway mem │  │schema fp  │ │5 active │
       └────┬───┘ └───┬────┘    └─────┬──────┘  └─────┬─────┘ └──┬──────┘
            └─────────┴────────────────┼────────────────┴─────────┘
                                       ▼
                            ┌──────────────────┐
                            │ object storage   │  Parquet · MinIO · S3-compat
                            └──────────────────┘
                                       ▲
                                       │
                       ┌───────────────┴────────────────┐
                       │  validator   ·   journald      │  schema/PII/policy gates
                       │  (Phase 43)  ·   (audit log)   │  + append-only mutations
                       └────────────────────────────────┘

Provider fleet (config/providers.toml):
  ollama         localhost:3200          local Ollama → qwen3.5, gemma2
  ollama_cloud   ollama.com              gpt-oss:120b, qwen3-coder:480b,
                                         deepseek-v3.1:671b, kimi-k2:1t,
                                         mistral-large-3:675b, qwen3.5:397b
  openrouter     openrouter.ai/api/v1    343 models — paid + free rescue
  opencode       opencode.ai/zen/v1      40 models · ONE sk-* key reaches
                                         Claude Opus 4.7, GPT-5.5-pro,
                                         Gemini 3.1-pro, Kimi K2.6, GLM 5.1,
                                         DeepSeek, Qwen, MiniMax, free tier
  kimi           api.kimi.com/coding/v1  direct Kimi For Coding (TOS-clean)

Per-crate responsibility (15 crates)

Crate	Role	Path
shared	Types, errors, Arrow helpers, PII detection, secrets provider, model_matrix	crates/shared/
storaged	object_store I/O, BucketRegistry, AppendLog, ErrorJournal, federation_service	crates/storaged/
catalogd	Manifests, views (incl. PII-safe view layer), tombstones, profiles, schema fingerprints, register-idempotency (ADR-020)	crates/catalogd/
queryd	DataFusion SQL, MemTable cache, delta merge-on-read, compaction, truth gate (ADR-021)	crates/queryd/
ingestd	CSV/JSON/PDF(+OCR)/Postgres/MySQL ingest, cron schedules, auto-PII flagging	crates/ingestd/
vectord	Embeddings as Parquet, HNSW, trial system, autotune, playbook_memory + pathway_memory (ADR-021 semantic-correctness layer)	crates/vectord/
vectord-lance	Firewall crate — Lance 4.0 + Arrow 57 isolated from main Arrow 55	crates/vectord-lance/
journald	Append-only mutation event log for time-travel + audit	crates/journald/
truth	File-backed rule store; `evaluate(task_class, ctx) → Vec<RuleOutcome>` (ADR-021)	crates/truth/
aibridge	Rust↔Python sidecar, Ollama client, ProviderAdapter trait, /v1/chat router	crates/aibridge/
gateway	Axum HTTP :3100 + gRPC :3101, OpenAI-compat /v1/*, mode runner, validator, iterate loop, cost telemetry, Langfuse + observer fan-out	crates/gateway/
validator	Phase 43 — schema / completeness / consistency / policy gates over LLM outputs (FillValidator, EmailValidator, ParquetWorkerLookup)	crates/validator/
ui	Dioxus WASM internal developer UI (separate from this Bun-served public UI)	crates/ui/
mcp-server	Bun TypeScript public-facing app + MCP tool surface — what you're reading right now	mcp-server/
auditor	External claim-vs-diff verifier on PRs · Kimi K2.6 ↔ Haiku 4.5 cross-lineage alternation, Opus 4.7 auto-promote on diffs >100k chars	auditor/

Source: git.agentview.dev/profit/lakehouse · branch scrum/auto-apply-19814 · tag distillation-v1.0.0 at commit e7636f2 (frozen substrate) · ADRs: docs/DECISIONS.md (currently 21 records)

Chapter 3

The model fleet — 9-rung ladder, N=3 consensus, cross-lineage audit

No single model owns the answer. Every consequential call is structured: the right tier picks up first, fallback rungs catch what fails, parallel runs vote, and an independent auditor of a different model lineage checks the result against the diff. The protocol is deterministic; the inference is stochastic; every step writes a receipt.

The 9-rung cloud-first ladder

  request in
      │
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │  attempt 1  ollama_cloud / kimi-k2:1t           1T params · flagship │
  │  attempt 2  ollama_cloud / qwen3-coder:480b     coding specialist    │
  │  attempt 3  ollama_cloud / deepseek-v3.1:671b   reasoning            │
  │  attempt 4  ollama_cloud / mistral-large-3:675b deep analysis        │
  │  attempt 5  ollama_cloud / gpt-oss:120b         reliable workhorse   │
  │  attempt 6  ollama_cloud / qwen3.5:397b         dense final thinker  │
  │  attempt 7  openrouter   / openai/gpt-oss-120b:free  rescue tier     │
  │  attempt 8  openrouter   / google/gemma-3-27b-it:free fastest rescue │
  │  attempt 9  ollama       / qwen3.5:latest       last-resort local    │
  └───────────────┬───────────────────────────────────────────────────┘
                  │  isAcceptable() = chars ≥ 3800 ∧ not malformed JSON
                  ▼
            sealed result OR next-rung learning preamble

Every rung sees a learning preamble carrying the prior rejection reason. The ladder is the standard scrum/auditor path; for individual /v1/chat calls the caller picks the model directly (or lets the smart-routing default fire).

Code: tests/real-world/scrum_master_pipeline.ts const LADDER · config/routing.toml · crates/gateway/src/v1/mode.rs (mode runner)

N=3 consensus + tie-breaker (auditor inference)

// auditor/checks/inference.ts — every claim audit runs this:
1. Fire the primary reviewer N=3 times in PARALLEL (Promise.all) — wall-clock = single call
2. Aggregate votes per claim_idx · majority wins
3. On 1-1-1 split → tie-breaker model with different architecture (qwen3-coder:480b vs primary gpt-oss/kimi)
4. Every disagreement (even when majority resolves) → data/_kb/audit_discrepancies.jsonl

// Closes the cloud-non-determinism gap: temp=0 isn't actually deterministic in practice
// across hours; consensus + cross-architecture tie-break stabilizes verdicts.

Auditor cross-lineage — Kimi ↔ Haiku ↔ Opus

Every push to PR #11 triggers auditor/audit.ts within ~90s. To prevent a single model lineage's blind spots from becoming the system's blind spots, audits alternate between Kimi K2.6 (Moonshot) and Haiku 4.5 (Anthropic) by SHA. Diffs over 100k chars auto-promote to Claude Opus 4.7. Per-PR cap of 3 audits with auto-reset on each new head SHA prevents infinite-loop spend. 100% grounding-verified rate on Haiku 4.5 across the latest 10 findings — pairing different lineages + forcing per-finding grounding kills confabulation.

Code: auditor/audit.ts · auditor/checks/inference.ts (N=3) · auditor/checks/kimi_architect.ts · Verdicts: data/_auditor/kimi_verdicts/ — read any 11-<sha>.json to inspect a real audit

Distillation v1.0.0 — the frozen substrate

The substrate the auditor and mode runner sit on is tagged at distillation-v1.0.0 / commit e7636f2. 145 unit tests pass · 22/22 acceptance invariants · 16/16 audit-full checks · bit-identical reproducibility verified. The distillation phase exports clean SFT / RAG / preference samples with a multi-layer contamination firewall; the auditor consumes the substrate. The frozen tag means: any future "the system regressed" question has a baseline to bisect against, byte-for-byte.

Tag: distillation-v1.0.0 · Commit: e7636f2 · Substrate code: scripts/distillation/ · auditor/schemas/distillation/ · Output: data/_kb/distilled_{facts,procedures,config_hints}.jsonl

Chapter 4

Two memory layers — playbook (worker signal) + pathway (system signal)

A CRM stores events. This system turns events into re-ranking signal at two layers. Playbook memory compounds worker-level outcomes (who got endorsed, where, when) into per-query boost. Pathway memory compounds system-level outcomes (which model + corpus + framing actually solved similar problems) into per-task hot-swap. Both are queryable. Both are auditable. Both compound.

Layer 1 — playbook memory (worker + geo signal)

Seed shape

PlaybookEntry {
  playbook_id, // pb-seed-<sha8>
  operation, // "fill: Welder x2 in Toledo, OH"
  approach, context, // short canonical — long strings dilute embedding
  timestamp, // RFC3339
  endorsed_names[], // validated against workers_500k for city+state
  city, state, // parsed from operation
  embedding // 768-d nomic-embed-text of text shape
}

Code: crates/vectord/src/playbook_memory.rs (PlaybookEntry, FailureRecord, PlaybookMemoryState)

Boost math (positive + decay + negative)

// For each playbook pb among top-K most cosine-similar:
// given query embedding qv, constant base_weight, n_workers = |pb.endorsed_names|

similarity = cosine(qv, pb.embedding)    // skip if ≤ 0.05
age_days = (now - pb.timestamp) / 86_400 seconds
decay = e^{-age_days / 30}   // half-life = 30 days

// For each endorsed worker in pb:
key = (pb.city, pb.state, name)
fail_count = failures[key]   // # times this worker was marked no-show for same geo
penalty = 0.5^{min(fail_count, 20)}

per_worker = similarity × base_weight × decay × penalty / n_workers
boost[key] = min(boost[key] + per_worker, MAX_BOOST_PER_WORKER)

// MAX_BOOST_PER_WORKER = 0.25 — cap stops one popular worker from always winning

Code: crates/vectord/src/playbook_memory.rs::compute_boost_for · constants: MAX_BOOST_PER_WORKER, DEFAULT_TOP_K_PLAYBOOKS, BOOST_HALF_LIFE_DAYS

Application at query time

// In /vectors/hybrid handler (crates/vectord/src/service.rs):
1. SQL filter narrows workers_500k to geo/role/availability
2. Vector index returns top_k × 5 candidates by cosine to question
3. compute_boost_for(qv, k=200) returns boost map
4. For each candidate: parse (name, city, state) from chunk, look up boost, add to score
5. Re-sort sources by boosted score
6. Truncate to requested top_k, return with playbook_boost and playbook_citations

Why k=200. Direct measurement showed cosine similarity clusters in the 0.55-0.67 band across all playbooks regardless of geo (nomic-embed-text has narrow discrimination on this kind of structured operation text). A k of 25 silently missed geo-matched playbooks. k=200 is the measured floor for reliably catching compounding. Brute-force over 200 × 768-d is sub-ms even on this hardware.

Evidence: Chicago Electrician compounding test 2026-04-20 — Carmen Green, Anna Patel, Fatima Wilson went from rank >5 / boost 0 / 0 citations (run 0, no seed) to rank 1/2/3 / boost +0.250 (capped) / 3 citations each (run 3, after 3 identical seeds). Each seed increments citations; total boost caps at 0.25/worker.

Write-through to SQL

successful_playbooks_live is a DataFusion-queryable Parquet surface maintained by POST /vectors/playbook_memory/persist_sql. Every /log from the recruiter UI triggers seed → persist_sql. The in-memory store and the SQL surface stay synchronized (full snapshot on each persist, safe because memory is source of truth).

Code: crates/vectord/src/playbook_memory.rs::persist_to_sql · catalog-registered under "successful_playbooks_live"

Pattern discovery (Path 2 — meta-index)

Beyond "who was endorsed." POST /vectors/playbook_memory/patterns takes a query, finds top-K similar past playbooks, pulls each endorsed worker's full workers_500k profile, and aggregates shared traits: recurring certifications, skill frequencies, modal archetype, reliability distribution. Returns a discovered_pattern string showing operator-actionable signal the user didn't explicitly query for.

Code: crates/vectord/src/playbook_memory.rs::discover_patterns · Surfaces: /vectors/playbook_memory/patterns endpoint, /intelligence/chat response, /intelligence/permit_contracts cards

Layer 2 — pathway memory (system-level hot-swap, ADR-021)

Pathway memory remembers which approach worked, not just which worker. Every accepted scrum review writes a PathwayTrace with the full backtrack: file fingerprint, model used, signal class, KB chunks consulted, observer events, semantic flags, bug fingerprints. A new query that fingerprints to the same trace can hot-swap to the prior result without re-running the 9-rung escalation. The 5-factor hot-swap gate is strict: narrow fingerprint match AND audit consensus pass AND replay_count ≥ 3 (probation) AND success_rate ≥ 0.80 AND NOT retired AND vector cosine ≥ 0.90.

// Live pathway state (refresh page to recompute):
— traces · — successful replays · — reuse rate
// 88 / 11/11 / 100% as of 2026-04-27 — probation gate crossed

Code: crates/vectord/src/pathway_memory.rs · Endpoints: /vectors/pathway/insert · /query · /record_replay · /stats · /bug_fingerprints · Spec: docs/DECISIONS.md ADR-021 — Semantic-correctness matrix layer

What both memory layers feed (besides search)

Both layers also feed the per-staffer hot-swap index (Chapter 5) and the Construction Activity Signal Engine (Chapter 6). One memory model, surfaced three different ways at the request boundary depending on who's asking.

Chapter 5

Per-staffer hot-swap — same corpus, different relevance gradient

Maria runs Chicago. Devon runs Indianapolis. Aisha runs Wisconsin/Michigan. They share one corpus, but the search results, the recurring-skill patterns, and the playbook context all reshape to whoever is acting. Same query "forklift operators" returns 89 IN workers when Devon's acting, 16 WI when Aisha's, 167 IL when Maria's. The MEMORY panel relabels itself with the active coordinator's name.

What scopes per staffer

// On every /intelligence/chat call:
if (b.staffer_id) {
  const staffer = lookupStaffer(b.staffer_id);
  // 1. Default state filter to staffer territory unless caller pinned one
  if (!explicitState) filters.push(`state = '${staffer.territory.state}'`);
  // 2. Default playbook-pattern geo to staffer's primary city/state
  cityForPatterns = staffer.territory.cities[0];
  stateForPatterns = staffer.territory.state;
  // 3. Surface staffer.name back so the UI can relabel MEMORY → MARIA'S MEMORY
  response.staffer = { id, name, territory };
}

The corpus stays intact. The relevance gradient is per coordinator. As each accumulates fills, their slice of the playbook compounds independently. The architecture generalizes — every new metro adds territories, not code paths.

Code: mcp-server/index.ts STAFFERS roster + lookupStaffer() · /staffers endpoint · /intelligence/chat smart_search route · UI: staffer dropdown in mcp-server/search.html

Chapter 6

Construction Activity Signal Engine — the corpus is also a market signal

Every contractor in this corpus is also a forward indicator on the public equities they touch. Permits filed today predict construction starts ~45 days out, staffing ~30, revenue recognition months later. The associated-ticker network surfaces this signal before any 10-Q. The architecture is metro-agnostic — Chicago is Phase 1; NYC DOB, LA County, Houston BCD, Boston ISD ship as Socrata-shaped adapters.

Three flavors of attribution

// per contractor in /intelligence/profiler_index:
direct // contractor IS a public issuer → SEC tickers index match
parent // curated KNOWN_PARENT_MAP — Turner → HOC.DE via Hochtief AG
associated // co-permit network — Bob's Electric appears with TARGET CORPORATION
// 3+ times → inherits TGT as an associated indicator

The associated path is the moat. A staffing-permit dataset that maps contractor-to-public-issuer is not commercially available; we synthesize it from the Socrata co-occurrence graph. Every additional metro multiplies edges.

Building Activity Index (BAI)

// BAI = attribution-weighted average day-change across surfaced issuers:
BAI = Σ (day_change_pct × attribution_count) / Σ attribution_count

// Indexed build value = total $ of permits attributable to ANY public issuer
// Network depth = issuers / total attribution edges

Run BAI daily, save the series, and you've got a backtestable thesis in months. Today's surface is Chicago-only with ~9 issuers; the curve scales linearly with metros added — and the marginal cost of a new metro is one Socrata adapter.

Code: mcp-server/index.ts /intelligence/profiler_index + /intelligence/ticker_quotes · entity.ts lookupTickerLite() · fetchStooqQuote() · UI: /profiler · Data sources: SEC company_tickers.json (in-memory index) + Stooq CSV API + curated parent-link map

Chapter 7

Key architectural choices — what was picked and why

Each choice is documented in docs/DECISIONS.md (Architecture Decision Records). If you dispute any of these, the ADR names the alternatives we rejected and the measurement that drove the call.

ADR-001 · Object storage as source of truth

No traditional database. All data is Parquet on S3-compatible object storage. Eliminates DB operational overhead; every engine can read Parquet.

ADR-008 · Embeddings stored as Parquet, not a vector DB

Keeps all data in one portable format. No Pinecone/Weaviate/Qdrant lock-in. Trade-off: brute-force search up to ~100K; HNSW beyond.

ADR-012 · Append-only event journal — never destroy evidence

Every mutation is appended. Compliance, audit, AI-decision forensics. Impossible to retrofit; easy to add now.

ADR-015 · Tool registry before raw SQL for agents

Named, governed, audited actions for agents. Permission checks, rate limits, parameter validation. MCP-compatible.

ADR-019 · Hybrid Parquet+HNSW ⊕ Lance vector backend

Parquet+HNSW primary (2.55× faster search at 100K). Lance secondary for index-build speed (14× faster), random fetch (112× faster), append (structural). Per-profile vector_backend: Parquet | Lance.

ADR-020 · Idempotent register() with schema-fingerprint gate

Same (name, fingerprint) reuses manifest. Different fingerprint = 409 Conflict. Prevents silent duplicate manifests. Cleanup run collapsed 374 → 31 datasets.

ADR-021 · Semantic-correctness matrix layer

Pathway memory carries semantic flags (UnitMismatch, TypeConfusion, OffByOne, StaleReference, DeadCode, BoundaryViolation, …) on every trace. New reviews see prior bug fingerprints as a preamble; recurrent classes get caught on first read. Compounds across files in the same crate.

Phase 19 design note · Statistical + semantic, not neural

Meta-index is cosine similarity + endorsement aggregation. No model training. Rebuildable from successful_playbooks alone. Neural re-ranker deferred to Phase 20+ only if statistical floor plateaus.

Distillation freeze · v1.0.0 at e7636f2

145 tests · 22/22 acceptance · 16/16 audit-full · bit-identical reproducibility. Multi-layer contamination firewall on SFT exports. Substrate the auditor + mode runner sit on; "the system regressed" questions bisect against this anchor.

Chapter 8

Measured at scale, on this machine

Hardware: i9 + 128GB RAM + Nvidia A4000 16GB VRAM + 2.5GB symmetric. Numbers below are from this running instance. Refresh the page and they'll recompute.

Loading scale data…

Chapter 9

Verify or dispute — reproduce it yourself

Every claim above is a curl away from falsification.

Gateway health. Returns provider matrix + worker count.

curl -s http://localhost:3100/v1/health | jq

Any SQL on multi-million-row Parquet. Sub-100ms typical.

curl -s -X POST http://localhost:3100/query/sql \
  -H 'Content-Type: application/json' \
  -d '{"sql":"SELECT role, COUNT(*) FROM workers_500k WHERE state=\"IL\" GROUP BY role LIMIT 5"}'

Hybrid search with playbook boost. SQL filter + vector rerank + playbook memory in one call.

curl -s -X POST http://localhost:3100/vectors/hybrid \
  -H 'Content-Type: application/json' \
  -d '{"index_name":"workers_500k_v1",
       "sql_filter":"role = '\''Forklift Operator'\'' AND city = '\''Chicago'\'' AND CAST(availability AS DOUBLE) > 0.5",
       "question":"reliable forklift operator",
       "top_k":5,"use_playbook_memory":true,"playbook_memory_k":200}'

Pathway memory stats. System-level hot-swap signal — should show 88 traces / 11 replays / 100% reuse rate (probation gate crossed).

curl -s http://localhost:3100/vectors/pathway/stats | jq

Per-staffer scoping. Same query, different rosters per coordinator.

for s in maria devon aisha; do
  curl -s -X POST http://localhost:3700/intelligence/chat \
    -H 'Content-Type: application/json' \
    -d "{\"message\":\"forklift operators\",\"staffer_id\":\"$s\"}" \
    | jq -r ".staffer.name + \": \" + (.sql_results | length | tostring) + \" workers, top: \" + (.sql_results[0].name + \" in \" + .sql_results[0].city + \", \" + .sql_results[0].state)"
done
# Maria: 167 workers, top: ... in Chicago, IL
# Devon: 89  workers, top: ... in Fort Wayne, IN
# Aisha: 16  workers, top: ... in Milwaukee, WI

Late-worker triage in one shot. Pulls profile + 5 backfills + drafts SMS. Should respond in under 300ms.

curl -s -X POST http://localhost:3700/intelligence/chat \
  -H 'Content-Type: application/json' \
  -d '{"message":"Marcus running late site 4422"}' | jq

Construction Activity Signal Engine. Profiler index with attribution, cost, last filed.

curl -s -X POST http://localhost:3700/intelligence/profiler_index \
  -H 'Content-Type: application/json' \
  -d '{"limit":10}' \
  | jq '.contractors[] | {name, permits, total_cost, direct: (.tickers.direct | map(.ticker)), associated: (.tickers.associated | map(.ticker + " ←via " + .partner_name))}'

Live ticker quotes. Batch Stooq pull for the basket.

curl -s -X POST http://localhost:3700/intelligence/ticker_quotes \
  -H 'Content-Type: application/json' \
  -d '{"tickers":["TGT","JPM","BALY","WBA","MCD"]}' | jq .quotes

Audit trail — read any verdict on PR #11. Independent claim-vs-diff verifier output.

ls /home/profit/lakehouse/data/_auditor/kimi_verdicts/
# 11-c3c9c2174a91.json  11-ca7375ea2b17.json  11-2d9cb128bf42.json …
jq '.findings[0:3]' /home/profit/lakehouse/data/_auditor/kimi_verdicts/11-c3c9c2174a91.json

Distillation acceptance gate. 22/22 invariants must pass for any commit that touches the substrate.

cd /home/profit/lakehouse
bun test auditor/schemas/distillation/ tests/distillation/
# Expect: 145 pass · 0 fail · 372 expect() calls

Chapter 10

What we are not claiming

Every impressive-sounding number comes with a footnote. Here are the honest limits as of 2026-04-27.

workers_500k is synthetic.

Real client ATS export replaces this table. Schema is deliberately identical to a production ATS so the swap is config, not code.

candidates table is light at 1,000 rows.

Intentionally small. Live PII-safe view layer is built; replacing the small table with a 100K+ ATS is a one-line config flip.

Chicago permit data is real.

Pulled live from data.cityofchicago.org/resource/ydr8-5enu.json (Socrata). Not synthetic. Not cached. Verifiable address-by-address.

Playbook memory is seeded from demo runs.

Same code path that seeds in production: every /log from the recruiter UI triggers seed → persist_sql. Demo seeds use the same shape as live operations.

Pathway memory probation gate is crossed.

88 traces, 11 replays, 11 successful, 100% reuse rate. Any pathway that fails to clear ≥0.80 success_rate after ≥3 replays gets retired automatically (sticky flag prevents oscillation).

SEC name-to-ticker fuzzy matcher has rare false positives.

For names with no clean SEC match the matcher occasionally surfaces a same-keyword small-cap (saw FLG attach to a PNC-adjacent contractor once). Kept conservative — minimum 2 non-stopword overlap. Tightenable to require explicit allow-list for production trading use.

12 awaiting public-data sources are placeholders.

DOL Wage & Hour, EPA ECHO, MSHA, BBB, PACER, UCC liens, D&B, etc. — listed by name on every contractor profile with a one-line "would show:" sample. Not yet wired. Each ships as a Socrata-style adapter; engineering scope is concrete.

No rate/margin awareness yet.

Worker pay expectations vs contract bill rates are not modeled. Flagged as a Phase 20 item; no architectural blocker.

BAI is a thesis, not a backtested signal.

The Building Activity Index is computed live from current attribution + day-change. To have a backtestable thesis we need the daily series saved over months. Architectural support is there (data/_kb/audit_baselines.jsonl pattern); just hasn't been running long enough.

Single-metro today.

Chicago via Socrata. NYC DOB, LA County, Houston BCD, Boston ISD, DC DCRA all use Socrata-equivalent APIs — adapters are config-only. Each new metro multiplies the network without multiplying the codebase.