root a97881d80c workers corpus + multi-corpus reality test — matrix indexer end-to-end
Lands the second real-data corpus (workers_500k) and the first
multi-corpus reality test through /v1/matrix/search composing both
corpora live.

What's new:
  - scripts/staffing_workers/main.go — parquet driver over
    workers_500k.parquet, multi-chunk arrow handling (workers
    parquet has multiple row groups vs candidates' one). Embed text:
    role + skills + certifications + city + state + archetype +
    resume_text. IDs prefixed "w-".
  - scripts/multi_corpus_e2e.sh — first end-to-end test composing
    both corpora through the matrix indexer.

Real-data multi-corpus result (this commit):
  Query: "Forklift operator with OSHA-30 certification, warehouse
          experience"
  Corpora: workers (5000 rows) + candidates (1000 rows)
  Merged top-8: workers=6, candidates=2

  Top hits:
    w  d=0.327  w-4573  Production Worker
    w  d=0.353  w-1726  Machine Operator
    w  d=0.362  w-3806  Production Worker
    w  d=0.366  w-1000  Machine Operator
    w  d=0.374  w-1436  Assembler
    w  d=0.395  w-162   Machine Operator
    c  d=0.440  c-CAND-00727  C#,.NET,Azure
    c  d=0.446  c-CAND-00031  React,TypeScript,Node

The matrix indexer correctly chose the right domain — manufacturing/
warehouse roles in workers (correct semantic match for the staffing
query) rank ABOVE software-engineer candidates from the candidates
corpus. 0.11 gap between the worst worker (0.395) and the best
candidate (0.440) — clean distance separation.

Compared to the candidates-only e2e run from 0d1553c:
  candidates-only top: c-CAND-00727 at d=0.4404
  multi-corpus top:    w-4573 at d=0.3265 (a Production Worker)

That's the matrix indexer's whole point made visible: composing
domain-distinct corpora surfaces better matches than single-corpus
search. Without workers in the search space, the staffing query
returned software engineers (wrong domain). With workers, it
returns roles in the right ballpark.

What's still imperfect (signal for component 5 + future work):
  - No top-6 worker actually has "Forklift" or "OSHA-30" visible in
    metadata; "Production Worker" is semantically nearest in this
    sample. Likely needs a larger workers ingest (5000 from 500K)
    or skill-keyword boost.
  - Status/availability still not gated. The staffing-side
    structured filtering gap from 0d1553c persists; relevance filter
    (CODE-aware) doesn't address it.

Pipeline timings:
  workers ingest: 5000 rows / 19.2s = 260/sec end-to-end
  candidates ingest: 1000 rows / 3.1s = 322/sec
  multi-corpus query (text → embed → 2 parallel vectord → merge): 14ms

14-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:22:16 -05:00

golangLAKEHOUSE

Go reimplementation of the Lakehouse — a versioned knowledge substrate for staffing analytics + local AI workloads.

Status

Phase G0 complete + G1/G1P/G2 shipped. Six binaries plus a seventh (vectord) and an eighth (embedd) on top, fronted by a single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2.

End-to-end staffing co-pilot pipeline functional through the gateway:

text → /v1/embed → /v1/vectors/index/<name>/add
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits

Plus the SQL path:

CSV  → /v1/ingest    (parses, writes Parquet via storaged, registers
                      manifest with catalogd)
SQL  → /v1/sql       (DuckDB over the registered Parquets via httpfs)

See docs/PHASE_G0_KICKOFF.md for the day-by-day record (D1-D6 + real-scale validation + G1/G1P/G2 pointer at the bottom).

Service inventory

Bin Port Role
gateway 3110 Reverse proxy fronting all backing services
storaged 3211 Object I/O over S3 (MinIO in dev)
catalogd 3212 Parquet manifest registry, ADR-020 idempotency
ingestd 3213 CSV → Parquet → register loop
queryd 3214 DuckDB SELECT over registered Parquets via httpfs
vectord 3215 HNSW vector search (+ optional persistence to storaged)
embedd 3216 Text → vector via Ollama (default nomic-embed-text 768-d)
mcpd stdio Model Context Protocol server (Claude Desktop / Code consumers)

MCP server

bin/mcpd exposes Lakehouse capabilities as MCP tools over stdio: list_datasets, get_manifest, query_sql, embed_text, search_vectors. All tools proxy to the gateway, so the gateway must be up first.

Wire into Claude Desktop / Claude Code by adding to the MCP config:

{
  "mcpServers": {
    "lakehouse": {
      "command": "/path/to/golangLAKEHOUSE/bin/mcpd",
      "args": ["--gateway", "http://127.0.0.1:3110"]
    }
  }
}

Replaces the Bun mcp-server.ts MCP-tool surface from the Rust system. HTTP demo routes (the staffing co-pilot UI) stay Bun until G5.

Acceptance smokes

scripts/d1_smoke.sh   # 5-binary skeleton + chi /health + gateway proxy probes
scripts/d2_smoke.sh   # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap
scripts/d3_smoke.sh   # catalogd register/manifest/list + rehydrate-across-restart
scripts/d4_smoke.sh   # ingestd CSV → Parquet round-trip + schema-drift 409
scripts/d5_smoke.sh   # queryd DuckDB SELECT through httpfs over MinIO
scripts/d6_smoke.sh   # full ingest → query through gateway only
scripts/g1_smoke.sh   # vectord HNSW recall + dim mismatch + duplicate-create 409
scripts/g1p_smoke.sh  # vectord state survives kill+restart via storaged
scripts/g2_smoke.sh   # embed → vectord add → search round-trip

Or run the full gate via the task runner (see below):

just verify     # vet + tests + 9 smokes; ~33s wall

Task runner

just                 # show available recipes
just verify          # full Sprint 0 gate (vet + tests + 9 smokes)
just smoke <day>     # single smoke (d1..d6, g1, g1p, g2)
just doctor          # check cold-start deps; --json for CI
just install-hooks   # install pre-push hook that runs just verify

After a fresh clone, run just install-hooks once so git push is gated on the same green chain that ran here. Hook lives in .git/hooks/pre-push (not tracked; recreated by the recipe).

Cold-start dependencies

  • Go 1.25+ at /usr/local/go/bin (arrow-go pulled the 1.25 floor)
  • gcc + libc-dev for the DuckDB cgo binding (ADR-001 §1.1)
  • just task runner (apt install just on Debian 13+)
  • MinIO running on :9000 with bucket lakehouse-go-primary
  • Ollama running on :11434 with nomic-embed-text loaded (G2)
  • /etc/lakehouse/secrets-go.toml with [s3.primary] credentials (storaged + queryd both read this)

just doctor probes all of the above and reports the fix command for each missing dep. CI / scripts can use just doctor --json.

Layout

docs/                         Direction + spec + ADRs + day-by-day
cmd/                          One main package per binary
internal/                     Shared packages — storeclient, catalogclient,
                                secrets, shared, embed, gateway, plus
                                per-service implementation packages
scripts/                      Smokes + ancillary tooling

Reading order

  1. docs/PRD.md — what we're building and why
  2. docs/SPEC.md — how, per-component
  3. docs/DECISIONS.md — ADRs (ADR-001 foundational)
  4. docs/PHASE_G0_KICKOFF.md — day-by-day from D1 through G2
  5. docs/RUST_PATHWAY_MEMORY_NOTE.md — historical reference for the Rust era's pathway memory (not migrated, by ADR-001 #5)

Predecessor

The Rust Lakehouse this rewrite supersedes lives at git.agentview.dev/profit/lakehouse. It remains the live system serving devop.live/lakehouse/ until this Go implementation reaches feature parity per docs/SPEC.md §7. Then Rust enters maintenance-only mode.

Description
Go reimplementation of the Lakehouse — versioned knowledge substrate for staffing analytics + local AI workloads
Readme 3.2 MiB
Languages
Go 79.4%
Shell 20.1%
Just 0.3%
Dockerfile 0.2%