Where the LLM goes in regulated reinsurance: a reference architecture for the API boundary and the audit trail

Every LLM tutorial assumes you can send the data to the API. In regulated reinsurance that assumption breaks before the first prompt — most of the architecture lives in the boundary, not the model.

Every tutorial for building an LLM application starts at the same place: take your data, send it to the model, render the response. For a regulated reinsurance workload that first step is already a compliance violation, and the rest of the tutorial is irrelevant until you've solved the part the tutorial skipped.

The interesting architecture in a regulated LLM system is almost entirely in the boundary — what crosses the API line, what stays on-prem, how each crossing is sanitized, how it's logged, and how the whole thing survives a model-version upgrade without a surprise. The model itself is the easy part. The boundary is six decisions, and the rest of this piece is each one in turn.

1. The data boundary

Before any code, draw a line. On one side is data that can never leave the corporate network in the body of an API call: treaty contracts with cedent names, claims data with policyholder identifiers, premium and loss postings, anything that would put you on the wrong side of GLBA or a state insurance department if it landed in a third-party log file. On the other side is data that's safe to send: generic patterns, schema names without values, structural questions about already-public artifacts, de-identified examples.

The middle is where the actual design lives. Treaty contracts can't cross the line, but you can send a Claude prompt that says “here is a sanitized schema for a treaty table — give me a TypeScript type that mirrors its structure.” The schema crosses; the values don't. Drawing this line precisely is the first piece of architecture, and it has to be written down explicitly — not as a security policy, but as a code-level boundary that the system enforces.

I keep a single allow-list in code, not in a wiki. The list names the document types and field categories that are permitted to cross the API boundary. Anything not on the list is rejected before the prompt is built, with a log line naming the field that failed. The wiki version of this rule rots inside a quarter; the code version fails closed every time.

2. The PII strip — three layers, on purpose

The single hardest thing to get right is also the most embarrassing to get wrong. Three layers, each protecting against a different failure of the others:

Pre-prompt sanitization — before the prompt is assembled, every field gets passed through a redactor that replaces named entities (policyholder names, addresses, account numbers, claim references) with stable tokens ({PERSON_1}, {ACCT_3}). The tokens are stable within a conversation so the model can reason about “Person 1” across turns, and unstable across conversations so the API provider can't correlate.
In-prompt guardrails — the system prompt itself tells the model to refuse to repeat any token that looks like identifying data, and to flag the request as a safety violation if it sees one. This is the layer that catches the bug in layer one. Models are surprisingly good at noticing “wait, this shouldn't be here.”
Post-response audit — every response gets run through a separate pattern detector before it's shown to the user, looking for SSN-shaped, account-number-shaped, and name-shaped strings. A hit doesn't just suppress the response — it pages someone, because the only way a hit can happen is if layers one and two both failed simultaneously, and that's a real bug.

The three layers exist because the failure modes are independent. Layer one fails when a new field shape gets added without an updated redactor. Layer two fails when the model has a bad day. Layer three is the canary — when it fires, you don't celebrate the catch; you treat it as an outage.

3. The audit trail decision: hash, don't store

Regulators want a record. The instinct is to log every prompt and response verbatim. That instinct is wrong, because the moment you write the prompts verbatim to a log file, the log file is itself regulated data — every retention rule that applies to the underlying claims data now applies to your logs, and your audit infrastructure has to inherit the same encryption-at-rest, access-control, and breach-notification posture as the database you were trying to protect.

What I log instead, per LLM call: timestamp, user identity, model and version, prompt SHA-256, response SHA-256, token counts, the business operation that triggered it, and the allow-list policy version that gated it. The raw prompts and responses live in a separate short-retention store, encrypted with a key that's held outside the audit infrastructure, queryable for replay debugging during a defined investigation window and then deleted on schedule.

The hashes give you a regulator-ready record (“here is evidence the model produced response X for prompt Y at time T, signed by user U, gated by policy version P”) without making the audit log itself an exfiltration target. The short-retention raw store gives you the debugging window you actually need, scoped tight enough that no one is tempted to mine it.

4. Model-version drift and the eval gate

Anthropic ships a new Sonnet roughly every quarter. In a non-regulated context that's an upgrade opportunity. In a regulated context it's a change-management event, because the system you got compliance approval for is no longer the system in production.

The architecture solves this by treating every production prompt as having an attached golden-set eval. The eval is a small set of inputs with known-correct (or known-acceptable) outputs, curated when the prompt is first cleared for production use. When a new model version is available, the eval runs against the new model before promotion; if any test in the golden set fails, the new model is held back until the prompt is re-tuned or the eval is re-baselined with a written rationale.

The cost is latency on model upgrades — the team doesn't get the new Sonnet on the day it ships; they get it the week after, once the evals pass. The benefit is that compliance can answer the question “has the system's behavior changed since the audit?” with a yes-or-no backed by reproducible eval runs, not a hand-wave.

Eval as the contract is the single most useful pattern I'd carry into any regulated LLM workload. It collapses the entire model-version-drift problem into a CI job.

5. The retrieval boundary: pay sanitization at ingest, not at query

RAG over regulated content is where most reference architectures get vague. The honest picture is: you cannot retrieve raw documents. The chunks you retrieve and feed back into the model have to be the output of a one-time sanitization pipeline that ran at ingest, not at query time.

The shape: source documents land in a staging area inside the corporate network. A pipeline reads them, strips PII using the same redactor as layer one of the prompt strip (so the sanitization is consistent), chunks them, and writes the sanitized chunks into the retrieval store. Embeddings are generated against the sanitized chunks, not the originals. The originals stay where they started; the retrieval store sees only the sanitized form.

The cost: you pay the sanitization compute once per document instead of once per query. That's actually a win — query traffic is unbounded, ingest is bounded by how often the source changes. The cost you actually pay is more subtle: if the sanitization is wrong in some way you don't notice for a month, you've been retrieving subtly-wrong chunks the whole time. Which is why the sanitization pipeline gets its own eval — separate from the prompt evals — that runs against a golden set of known-input → expected-sanitized-output examples.

The benefit: retrieval at query time stays cheap. The model sees only data that has already been cleared. The boundary is enforced once, at the right point in the pipeline.

6. What stays on-prem, permanently

The final piece of architecture isn't a system component; it's a written list of decisions the LLM is not allowed to make or materially influence, regardless of how capable the model becomes. Final actuarial pricing, regulatory submissions, claim-denial decisions, treaty cancellation triggers. The LLM can help draft, analyze, summarize, and surface — it cannot decide, and the boundary between “helped draft” and “decided” needs to be enforced by workflow, not by hoping the prompt is worded carefully.

Naming this list is part of the architecture for two reasons. First, it gives compliance something concrete to sign off on, rather than asking them to evaluate a probability that the system might one day cross a line. Second, it removes the recurring pressure to move that boundary inward as the model gets better. The boundary moves with deliberate review, not with model release notes.

What I'd change today

Three things this reference architecture gets only partially right, written down so I remember to revisit them:

The sanitization eval is the soft spot. The prompt eval gate is well-understood — golden set, run on promotion, gate on failure. The sanitization pipeline's eval is harder, because the failure modes are subtle (a partially redacted name, a regional date format that masquerades as an identifier) and the golden set has to keep up with whatever new document types ingest. I don't have a clean answer for how to make that eval as boring as the prompt eval. Right now it relies on the redactor being audited as code, which works until the next domain-specific document type lands.

The audit hash assumes the prompt is reconstructible. Hashing prompt and response is great until the prompt was partially generated at runtime and the inputs to that generation weren't themselves logged. The hash is then a record without a way to reproduce what it was a hash of. The fix is to log the full prompt-generation trace (the template version, the retrieved chunk IDs, the redactor version), not just the final prompt's hash. I haven't fully wired that.

The on-prem boundary needs to be reviewed every quarter, not annually. Models get better faster than annual review cycles. The list of “LLM does not decide” items needs a quarterly cadence with a written rationale for any boundary that moves. Otherwise the boundary stays static while the model capability changes, and the architecture drifts from “deliberate” to “inherited.”

Why this lives at the boundary, not at the model

The reason every section above is about the boundary, not about the prompt or the model, is that the model itself is going to keep getting better and the prompt is going to keep getting tuned. Those things move all the time. What doesn't move — and what regulators actually care about — is the answer to “where does the data go, who can see it, and how do you prove it.” That's a property of the architecture, not the model. Get the boundary right and the prompt iteration is safe to do quickly. Get the boundary wrong and no amount of prompt engineering is going to put the data back where it belongs.

I'd also note that this isn't a reinsurance-only pattern. The same six decisions apply to LLM-assisted workflows in claims processing, underwriting, banking operations, and any healthcare context I've worked adjacent to. The data labels change. The boundary is the architecture every time.