New cmd/mcpd binary using github.com/modelcontextprotocol/go-sdk
v1.5.0 over stdio transport. Exposes Lakehouse capabilities as MCP
tools: list_datasets, get_manifest, query_sql, embed_text,
search_vectors. Each tool proxies to the gateway via HTTP.
Replaces the MCP-tool subset of the Rust system's Bun mcp-server.ts
(the audit's "split this 2520-line empire" finding from R-005). HTTP
demo routes (the staffing co-pilot UI at /api/intelligence/*,
/headshots/*, etc.) stay Bun until G5 cutover — those are demo-
specific and depend on matrix-indexer signals not yet ported.
Architecture:
cmd/mcpd/main.go (235 LoC)
main() reads --gateway flag, builds server via buildServer(),
runs on StdioTransport. Each tool's args is a typed struct with
jsonschema tags (the SDK's canonical pattern); reflection
generates the JSON Schema automatically.
gatewayClient: thin HTTP wrapper over the configured gateway URL.
30s per-request timeout. 16 MiB tool-response cap. Non-2xx
surfaces as IsError CallToolResult (NOT as transport error) so
the LLM caller sees the error text and can decide how to react.
proxy() handles GET + POST + JSON body uniformly. errorResult()
+ jsonResult() helpers normalize CallToolResult shape.
cmd/mcpd/main_test.go (13 test funcs)
Tests the full MCP wire end-to-end without a subprocess: spin
up a fake gateway via httptest, build the MCP server pointed at
it, connect a client via in-memory transports (NewInMemoryTransports),
call each tool. Each tool gets:
- happy path (gateway returns 200 → tool returns content)
- input validation (missing required fields → IsError)
- upstream error (gateway 4xx → tool returns IsError)
Plus TestListTools verifies all 5 tools register; TestGatewayUnreachable
verifies network-level failures surface as IsError, not panics.
Setup for Claude Desktop / Code documented in README:
{
"mcpServers": {
"lakehouse": {
"command": "/path/to/bin/mcpd",
"args": ["--gateway", "http://127.0.0.1:3110"]
}
}
}
Verified:
go test -count=1 ./cmd/mcpd/ — 13/13 green
just verify — vet + test + 9 smokes 35s
Out of scope for this commit:
- Resources (mcp.AddResource): not needed yet; tools cover the
interactive surface. Add when an LLM-side use case shows up.
- Prompts (mcp.AddPrompt): same.
- Streamable transports (HTTP, SSE): stdio is the universal one;
streamable can be added with srv.Run(ctx, &mcp.StreamableHTTPHandler{})
swap if a daemon-mode deploy makes sense.
- mcpd inside the daemon-supervised stack: it's stdio-only and
spawned by the MCP client, not run as a service. Adding a
daemon-mode (HTTP transport on a port) is a follow-up if MCP
consumers want long-lived sessions.
This is a tool-surface only port. The Bun mcp-server.ts also serves
HTTP demo routes (/api/catalog/datasets, /intelligence/*, /headshots/*)
that depend on the matrix-indexer signals from the Rust system; those
stay Bun until G5 cutover when the staffing co-pilot service ports
to Go.
Direct deps added:
github.com/modelcontextprotocol/go-sdk v1.5.0
Transitive (resolved by go mod tidy):
github.com/google/jsonschema-go v0.4.2
github.com/yosida95/uritemplate/v3 v3.0.2
golang.org/x/oauth2 v0.35.0
github.com/segmentio/encoding v0.5.4
github.com/golang-jwt/jwt/v5 v5.3.1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
137 lines
5.0 KiB
Markdown
137 lines
5.0 KiB
Markdown
# golangLAKEHOUSE
|
|
|
|
Go reimplementation of the Lakehouse — a versioned knowledge
|
|
substrate for staffing analytics + local AI workloads.
|
|
|
|
## Status
|
|
|
|
**Phase G0 complete + G1/G1P/G2 shipped.** Six binaries plus a
|
|
seventh (vectord) and an eighth (embedd) on top, fronted by a
|
|
single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2.
|
|
|
|
End-to-end staffing co-pilot pipeline functional through the
|
|
gateway:
|
|
|
|
```
|
|
text → /v1/embed → /v1/vectors/index/<name>/add
|
|
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
|
|
```
|
|
|
|
Plus the SQL path:
|
|
```
|
|
CSV → /v1/ingest (parses, writes Parquet via storaged, registers
|
|
manifest with catalogd)
|
|
SQL → /v1/sql (DuckDB over the registered Parquets via httpfs)
|
|
```
|
|
|
|
See `docs/PHASE_G0_KICKOFF.md` for the day-by-day record (D1-D6 +
|
|
real-scale validation + G1/G1P/G2 pointer at the bottom).
|
|
|
|
## Service inventory
|
|
|
|
| Bin | Port | Role |
|
|
|---|---|---|
|
|
| `gateway` | 3110 | Reverse proxy fronting all backing services |
|
|
| `storaged` | 3211 | Object I/O over S3 (MinIO in dev) |
|
|
| `catalogd` | 3212 | Parquet manifest registry, ADR-020 idempotency |
|
|
| `ingestd` | 3213 | CSV → Parquet → register loop |
|
|
| `queryd` | 3214 | DuckDB SELECT over registered Parquets via httpfs |
|
|
| `vectord` | 3215 | HNSW vector search (+ optional persistence to storaged) |
|
|
| `embedd` | 3216 | Text → vector via Ollama (default `nomic-embed-text` 768-d) |
|
|
| `mcpd` | stdio | Model Context Protocol server (Claude Desktop / Code consumers) |
|
|
|
|
## MCP server
|
|
|
|
`bin/mcpd` exposes Lakehouse capabilities as MCP tools over stdio:
|
|
`list_datasets`, `get_manifest`, `query_sql`, `embed_text`, `search_vectors`.
|
|
All tools proxy to the gateway, so the gateway must be up first.
|
|
|
|
Wire into Claude Desktop / Claude Code by adding to the MCP config:
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"lakehouse": {
|
|
"command": "/path/to/golangLAKEHOUSE/bin/mcpd",
|
|
"args": ["--gateway", "http://127.0.0.1:3110"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Replaces the Bun `mcp-server.ts` MCP-tool surface from the Rust system.
|
|
HTTP demo routes (the staffing co-pilot UI) stay Bun until G5.
|
|
|
|
## Acceptance smokes
|
|
|
|
```
|
|
scripts/d1_smoke.sh # 5-binary skeleton + chi /health + gateway proxy probes
|
|
scripts/d2_smoke.sh # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap
|
|
scripts/d3_smoke.sh # catalogd register/manifest/list + rehydrate-across-restart
|
|
scripts/d4_smoke.sh # ingestd CSV → Parquet round-trip + schema-drift 409
|
|
scripts/d5_smoke.sh # queryd DuckDB SELECT through httpfs over MinIO
|
|
scripts/d6_smoke.sh # full ingest → query through gateway only
|
|
scripts/g1_smoke.sh # vectord HNSW recall + dim mismatch + duplicate-create 409
|
|
scripts/g1p_smoke.sh # vectord state survives kill+restart via storaged
|
|
scripts/g2_smoke.sh # embed → vectord add → search round-trip
|
|
```
|
|
|
|
Or run the full gate via the task runner (see below):
|
|
```
|
|
just verify # vet + tests + 9 smokes; ~33s wall
|
|
```
|
|
|
|
## Task runner
|
|
|
|
```
|
|
just # show available recipes
|
|
just verify # full Sprint 0 gate (vet + tests + 9 smokes)
|
|
just smoke <day> # single smoke (d1..d6, g1, g1p, g2)
|
|
just doctor # check cold-start deps; --json for CI
|
|
just install-hooks # install pre-push hook that runs just verify
|
|
```
|
|
|
|
After a fresh clone, run `just install-hooks` once so `git push` is
|
|
gated on the same green chain that ran here. Hook lives in
|
|
`.git/hooks/pre-push` (not tracked; recreated by the recipe).
|
|
|
|
## Cold-start dependencies
|
|
|
|
- Go 1.25+ at `/usr/local/go/bin` (arrow-go pulled the 1.25 floor)
|
|
- `gcc` + `libc-dev` for the DuckDB cgo binding (ADR-001 §1.1)
|
|
- `just` task runner (`apt install just` on Debian 13+)
|
|
- MinIO running on `:9000` with bucket `lakehouse-go-primary`
|
|
- Ollama running on `:11434` with `nomic-embed-text` loaded (G2)
|
|
- `/etc/lakehouse/secrets-go.toml` with `[s3.primary]` credentials
|
|
(storaged + queryd both read this)
|
|
|
|
`just doctor` probes all of the above and reports the fix command
|
|
for each missing dep. CI / scripts can use `just doctor --json`.
|
|
|
|
## Layout
|
|
|
|
```
|
|
docs/ Direction + spec + ADRs + day-by-day
|
|
cmd/ One main package per binary
|
|
internal/ Shared packages — storeclient, catalogclient,
|
|
secrets, shared, embed, gateway, plus
|
|
per-service implementation packages
|
|
scripts/ Smokes + ancillary tooling
|
|
```
|
|
|
|
## Reading order
|
|
|
|
1. `docs/PRD.md` — what we're building and why
|
|
2. `docs/SPEC.md` — how, per-component
|
|
3. `docs/DECISIONS.md` — ADRs (ADR-001 foundational)
|
|
4. `docs/PHASE_G0_KICKOFF.md` — day-by-day from D1 through G2
|
|
5. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
|
|
Rust era's pathway memory (not migrated, by ADR-001 #5)
|
|
|
|
## Predecessor
|
|
|
|
The Rust Lakehouse this rewrite supersedes lives at
|
|
`git.agentview.dev/profit/lakehouse`. It remains the live system
|
|
serving `devop.live/lakehouse/` until this Go implementation reaches
|
|
feature parity per `docs/SPEC.md` §7. Then Rust enters
|
|
maintenance-only mode.
|