Designing the Modernization Factory: an 8-stage agentic pipeline for COBOL→AWS, and where each Claude model earns its place

Turning mainframe COBOL into AWS-native code with an LLM is easy to demo and hard to make repeatable. The architecture that makes it a factory rather than a party trick is in the pipeline shape, the per-stage model routing, and the one checkpoint that sits before the irreversible work.

You can paste a COBOL program into a chat window, ask for a TypeScript equivalent, and get something plausible back. That demo has convinced a lot of people that mainframe modernization is now a solved prompt. It isn't. The gap between “plausible TypeScript for one program” and “a system I can point at a hundred JCL jobs and trust the output enough to ship after a light review” is the entire problem, and that gap is filled with architecture, not prompting.

The Modernization Factory is my answer to that gap: an 8-stage agentic pipeline — itself built as an MCP server — that takes mainframe JCL, COBOL, and copybooks and produces AWS Lambda handlers, an Aurora PostgreSQL schema, a Step Functions state machine, and a parity test suite that proves the cloud output matches the mainframe. This post is the architecture: why a pipeline instead of a megaprompt, why a different Claude model runs each stage, and why the whole thing stops and waits for a human in exactly one place.

Why a pipeline, not one big prompt

The single-prompt approach fails for a reason that has nothing to do with model capability: a mainframe migration is not one decision, it's a sequence of decisions where the later ones depend on the earlier ones being right. You have to parse the JCL to know the job structure. You have to parse the COBOL to separate data access from business logic. You have to decide which steps become Lambda functions and which become bulk SQL operations before you can design the schema. You have to have the schema before you can generate code that reads from it. Cramming all of that into one prompt means one wrong inference early poisons everything after it, and you can't see where it went wrong.

So the Factory is 8 stages, and the contract between every stage is a typed Pydantic model — never a raw dict. Stage N takes typed input, returns typed output, writes that output to disk immediately, and Stage N+1 reads it. Three things fall out of that shape for free: every stage is independently testable, the whole run is checkpoint/resume-able from disk, and when a run fails you know exactly which stage produced the bad artifact because it's sitting there in JSON.

The core decision: a different model for each stage

The most consequential architectural call in the whole system is that it does not use one model. Each of the 8 stages runs against a deliberately chosen Claude tier, picked by the shape of that stage's work:

Haiku — Stage 1 (triage + input guard). Pure categorical classification with a tiny output budget. Is this JCL, COBOL, or a copybook? Strip the comments, redact PII, flag injection attempts. Haiku does this accurately and cheaply; spending Sonnet or Opus here would be lighting money on fire.
Sonnet — Stages 2, 3, 5, 6, 7 (the structured bulk). JCL parsing, COBOL parsing, schema design, Lambda code generation, Step Functions composition. These need strong language understanding but not deep architectural reasoning — they're extraction and generation against a well-specified target. Sonnet is excellent at this at a fraction of Opus's cost.
Opus — Stages 4 and 8 (judgment and creativity). Stage 4 is the architectural decision (more on it below). Stage 8 designs the parity tests, which is genuinely creative work — anticipating mainframe numeric-precision edge cases, designing the human-approval workflow tests, covering the failure paths. These are the two stages where the most capable model changes the outcome, so they're the two stages that get it.

The result is roughly 70–80% lower cost than running Opus on every stage, with equal or better quality on the structured stages — Sonnet is not “worse” at structured extraction, it's appropriately matched to it. Model tiering is the pattern that turns “technically possible” into “economical to run on a hundred programs,” and getting it right is most of what separates an engineered pipeline from a demo that happens to use an expensive model for everything.

The one stage that gets to think

Stage 4 — the Layer Decision — is where the migration's architecture actually gets decided. For each step in the JCL job, it routes the work to one of four destinations: a Lambda function (business logic), an Aurora SQL operation (bulk data work), a DynamoDB access (key-value lookups), or nothing at all (a sort or utility step that the cloud version doesn't need). Get this routing wrong and every downstream stage builds on a bad foundation — the schema is wrong, the generated code is wrong, the tests fail.

So Stage 4 is the only stage that runs Opus with extended thinking enabled — a private reasoning budget (up to ~8,000 tokens) where the model deliberates over the trade-offs before producing its routing decision. The thinking blocks are stripped from the output; only the final structured decision is kept. No other stage gets extended thinking, because no other stage benefits from it — extraction and code generation don't improve when you let the model ruminate; a judgment call between competing architectural options does. Enabling it everywhere would be the same mistake as using Opus everywhere: paying for a capability the work doesn't need.

The checkpoint that sits before the irreversible work

The Factory pauses for a human exactly once by default — after Stage 5 (schema design) and before Stage 6 (code generation). The placement is the whole point. A wrong Aurora schema cascades into wrong Lambda code that reads from it, wrong Step Functions that orchestrate the Lambdas, and failing parity tests at the end. The schema is the cheapest place in the entire pipeline to catch an error, because everything expensive happens after it. So that's where the pipeline stops and asks a human to look.

Mechanically, the checkpoint isn't special-cased — it's just the resume protocol exposed as a feature. The pipeline persists its artifacts and a checkpoint file, yields a “pending approval” event, and halts. The human reviews the schema and the routing decisions, then calls resume, which reloads every prior artifact from disk and continues from the next stage with full context restored. A rejection is logged and stops the run with all prior work preserved. Unattended runs are possible (you can tell it not to stop), but the default is human-in-the-loop, because the default should be the safe thing.

Human oversight is also baked into the generated artifacts, not just the pipeline. Every generated Lambda is dry-run by default — it computes its result and returns it without writing to Aurora until someone explicitly enables writes. The Step Functions definition models human-approval gates as first-class callback states. And the parity test suite has a dedicated category of tests for the approval workflow itself. The pipeline doesn't just pause for a human once; it produces code that keeps a human in the loop after it ships.

Parallelism and caching, where they pay

Two of the eight stages run one Claude call per item: Stage 3 parses each COBOL program, Stage 6 generates each Lambda. These fan out concurrently — but Stage 6 fans out behind a concurrency cap of 3, because a job with a dozen Lambda steps firing a dozen simultaneous Opus-adjacent calls is how you trip an API rate limit. The cap is the difference between “parallel” and “parallel until it throttles and half the calls fail.”

Those same two stages are where prompt caching earns the most. The system prompt for a stage is static — it's the same instruction set whether you're parsing the first COBOL program or the fifth. Marking it as a cached block means the first call pays full price to populate the cache and every subsequent call in that stage reads the system prompt at a 90% discount. On the parallel stages, where the same prompt is reused N times in a single run, that's most of the input-token cost gone. The cost tracker records both the discounted spend and the savings, so the economics of the caching decision are visible per run rather than assumed.

Treating untrusted source as untrusted

Mainframe source is not safe input. Comment lines routinely carry decades-old data samples — real SSNs, real account numbers — and occasionally adversarial instructions left by developers who never imagined a language model would read them. So everything passes through five defence layers before it reaches Claude: strip the comment lines (flagging any that look like injection attempts), redact PII patterns, validate structural integrity (real JCL starts a certain way and has an EXEC statement; real COBOL has a PROCEDURE DIVISION), wrap everything in untrusted tags so the model treats it as data to analyze rather than instructions to follow, and run an injection-pattern detector that flags and logs anything suspicious.

And the audit log records metadata only — classification, step counts, routing decisions, elapsed time, token costs, error strings. Never the raw source, never the model's responses, never customer data. The audit trail proves what the pipeline did without becoming a second copy of the regulated data it was processing. (That last decision is the same one I wrote about in the compliant-LLM reference architecture — hash and log metadata, don't store the sensitive payload, or your audit infrastructure inherits every rule that applied to the thing you were protecting.)

What I'd change today

Three things, written down so I revisit them:

The Layer Decision needs a golden-set eval. It's the highest-blast-radius stage in the pipeline, and right now its quality is validated by the human at the Stage 5 checkpoint rather than by a reproducible eval that gates a prompt or model-version change. When a new Sonnet or Opus ships, I want to run the routing stage against a set of known-good migrations and gate promotion on the result — not discover a behavior change in production. This is the gap I'd close first.

Parity tests should run before the human sees the output. Today the pipeline generates the tests as a final artifact; the human reviews code and tests together. The better shape is to execute the parity suite against the generated Lambdas in a sandbox automatically, so the human review starts from “the tests pass, here's the diff against the mainframe” rather than “here's the code, hope it's right.” The information the human needs to approve confidently should be computed before they're asked to approve.

Cost should surface itself. The per-stage token, cache-savings, and USD data is all captured per run — but it lands in JSON files I have to go read. A principal-level system watches its own cost trend and flags the run that cost 4× the median. Right now I find the outliers by looking; I'd rather they find me.

Why it's a factory, not a script

The word “factory” is doing real work in the name. A script transforms one input. A factory runs the same disciplined process over many inputs and produces consistent, inspectable output every time — with the same checkpoint protocol, the same audit format, the same cost ledger, the same safe-by-default generated code, whether you feed it one JCL job or a hundred. The architecture optimizes the per-run operational surface — one way to resume, one way to review, one way to audit — rather than tuning each program by hand. That's the line between a clever use of an LLM and an engineered system, and it's the line the whole design is organized around.

None of the individual pieces here are exotic. Typed contracts between stages, model tiering by task, extended thinking on the one judgment call, a checkpoint before the irreversible work, defence-in-depth on untrusted input, metadata-only audit. What makes it a system is that the pieces are chosen deliberately and fit together — and that the boring parts (resume, audit, cost tracking, safe defaults) are treated as load-bearing rather than afterthoughts. That's what it takes to trust an agent's output enough to ship it.