Architectures

A curated set of production reference architectures from regulated-industry work. Each card names the constraint that mattered, the decision I owned, the tradeoff that came with it, and what I'd change today. Linked to fuller writeups under /work and decision logs under /decisions.

Architecture · 01

Modernization Factory — an agentic COBOL→AWS pipeline

Problem

Turn mainframe JCL + COBOL + copybooks into AWS-native artifacts (TypeScript Lambda, Step Functions, Aurora DDL, parity tests) repeatably — for one program or a hundred — without a wrong architectural decision early in the run cascading into a whole run of wrong code.

Decision

An 8-stage async pipeline (built as an MCP server) where each stage runs against a deliberately chosen Claude tier instead of one model for everything: Haiku gates and triages, Sonnet does the structured extraction and codegen bulk, Opus with extended thinking handles only the one architectural-routing stage (Lambda / Aurora SQL / DynamoDB / skip) where deliberation changes the outcome, and Opus again designs the parity tests. Human-in-the-loop by default — the pipeline pauses for schema review before it writes any code. Each stage is typed Pydantic in / out and writes its artifact to disk immediately, so the whole run is checkpoint/resume-able.

Tradeoff

Per-stage model tiering is more configuration and a model-selection rationale to maintain per stage — but it runs roughly 70–80% cheaper than Opus-everywhere with no quality loss on structured work, and it's the decision that makes the pipeline economical at a-hundred-programs scale. Extended thinking is enabled on exactly one stage; turning it on everywhere would burn tokens for no gain.

What I'd change today

Attach a golden-set eval to the Layer Decision stage (highest blast radius, currently validated by human review rather than a reproducible gate), and run the parity-test suite against the generated Lambdas in a sandbox before the human review — so review starts from ‘tests pass, here’s the diff’ rather than ‘here’s the code, hope it’s right’.

Diagram

Loading diagram...

Source

Modernization Factory →

Architecture · 02

Metadata-driven cloud ETL

Problem

Migrate 23 mainframe tables to Aurora, then keep a reporting pipeline live across cadences from 15-minute to weekly — without standing up 23 operational surfaces to test, version, and run on-call against.

Decision

3 generalized Glue jobs (extract / transform / load) driven by a metadata table that names each source and its per-source parameters, instead of the obvious 3-per-table design (~69 jobs). The operational surface scales with the number of jobs, not the number of tables — one bug fix, not 23.

Tradeoff

Adding a genuinely unusual source means extending the metadata vocabulary, not just dropping in a new job. That's a clean failure mode; the alternative (69 nearly-identical jobs each accreting their own bugs) is not.

What I'd change today

Write the metadata vocabulary as a versioned first-class artifact on day one, and stand up the generalized pipeline on two or three representative sources end-to-end before scaling to all 23.

Diagram

Loading diagram...

Source

Enterprise Reinsurance — Data Migration →

Architecture · 03

Hybrid sync/async serverless reads — DynamoDB-over-Aurora

Problem

Cash-clearance search returning 100K+ rows was timing out at 3–5 minutes on a partitioned Aurora workload, with no tolerance for the user staring at a spinner past five seconds — and concurrent operators on the same treaty's transactions had to admit each other without serializing on a database lock.

Decision

Materialize each search's full result set into a DynamoDB snapshot keyed by user + filter + version token. The cache is load-bearing for correctness, not just speed: the version token doubles as the optimistic-concurrency contract for writes. Aurora gets read once per snapshot; every subsequent grid scroll, sort, and write validates against the cached snapshot.

Tradeoff

DynamoDB became the workflow's source of truth for ‘what set of rows this user is operating on’. That call had to be named explicitly to the data team and the finance owner before commit; bugs in snapshot semantics now have to be fixed in DDB, not in the database.

What I'd change today

Ship the DynamoDB cache layer behind a feature flag from day one rather than after the redesign was the only path left, and build the cache-eviction story earlier — generous TTL worked in practice, but the principal move is a deliberate invalidation answer, not a default that's just long enough not to matter.

Diagram

Loading diagram...

Source

Enterprise Reinsurance — Cash Clearance →

Architecture · 04

Zero-trust SaaS for regulated workloads

Problem

Multi-tenant childcare SaaS with sensitive child + parent + financial data — every tenant's data has to be invisible to every other tenant, by construction, not by hoping the app layer enforces it correctly on every query.

Decision

Row-Level Security on every Supabase table as the primary enforcement layer, with the publishable API key locked out completely so all reads go through server routes using the secret key. Auth is passwordless magic links via Supabase Auth gated by invite codes — no shared credential ever crosses a wire. Private media (photos, PDFs) is served via server-issued presigned URLs that expire per session.

Tradeoff

Every new table costs an RLS policy review — that's the right cost to pay. The alternative (app-layer enforcement on a normal Postgres) means one missed WHERE clause exposes everyone, and the failure is invisible until it isn't.

What I'd change today

Wire a policy-coverage check into CI from day one — a test that asserts every table has at least one RLS policy and that the publishable key actually fails on a representative SELECT against each one. Hand-auditing as the schema grows is the soft spot.

Diagram

Loading diagram...

Source

BCCS — Daycare Management Portal →

Architecture · 05

MCP as the integration layer for AI-augmented teams

Problem

An LLM-augmented engineering team needs to read legacy databases and write tickets — but giving every developer DB credentials or Jira tokens to feed into ad-hoc scripts means no audit trail, no rate-limit, no defence-in-depth, and every engineer's IDE becomes a separate attack surface.

Decision

Stand up MCP servers as the integration boundary for every external system the agent touches — read-only DB access through MCP-SQL (defence-in-depth read-only: connection flag + statement-prefix allow-list + hard query caps), ticket workflows through MCP-Jira (one auditable boundary instead of nine ad-hoc HTTP calls). Same agent in every IDE; same audit story; same guardrails; one place to rotate credentials or kill a misbehaving tool.

Tradeoff

Each new external system needs an MCP server before the agent can use it safely. That's deliberate friction — the alternative is unbounded ad-hoc API access from every developer's machine, which is what the boundary exists to prevent.

What I'd change today

Centralize the credential layer earlier — both MCP-SQL and MCP-Jira resolve creds independently right now (env vars + OS keychain). A single org-level secret broker that issues short-lived tokens to MCP servers is the next move; same pattern, smaller blast radius.

Diagram

Loading diagram...

Source

mcp-sql →mcp-jira →

Architecture · 06

Serverless RAG with a cost ceiling that can't be exceeded

Problem

Recruiter-facing chatbot embedded in a $10/month hobby site — a burst of curious traffic at 1 AM cannot overspend; every claim has to cite back to a source page on the same site; the embedding pipeline cannot break a deploy when the typo fix is unrelated to the chat path.

Decision

Cost guardrails fail before any paid API call: $10/month spend cap + 20 messages/IP/day rate limit, both enforced via atomic PL/pgSQL RPCs that prevent double-spend under concurrent requests. The model only paraphrases retrieved chunks with inline source-path citations — no inventing details, no answer that the recruiter can't click into and verify. Embedding pipeline is fail-soft on prebuild: a Voyage outage logs and exits zero so the deploy still ships; the chatbot stays one revision stale.

Tradeoff

Three sequential PostgREST round-trips on the cold path (rate-limit, spend, retrieval) instead of one batched RPC — clarity per call is good, first-token latency is worse. Worth revisiting at scale; correct at current traffic.

What I'd change today

Batch the three calls into one PL/pgSQL function returning {`{ allowed, reason, citations }`}; drop chunks below a cosine similarity floor before emitting citations; log retrieval quality per question (chunk ids, scores, answer hash) so low-similarity retrievals surface in a nightly check instead of by accident.

Diagram

Loading diagram...

Source

Ask My Career →