Enterprise Reinsurance — Data Migration and Reporting
Metadata-driven cloud ETL + event-driven reporting platform
Context
Migrated the Enterprise Reinsurance application from mainframe DB2-LUW to Aurora PostgreSQL on AWS — historical data via AWS DMS, ongoing transformations via Glue/PySpark. Built the reporting layer from scratch as an event-driven Glue pipeline (cadences from 15 minutes to weekly via S3 PutObject and scheduled triggers) that extracts from the application database, transforms millions of rows, and lands curated data in a reporting database feeding PowerBI dashboards.
Constraints
- No write freeze on the source mainframe — the carrier carried $billions in active treaty exposure and ~50 daily actuarial and finance consumers, so an overnight cutover window was off the table.
- Regulated reinsurance workload — every row landing in the reporting layer had to reconcile against the mainframe source of truth, because downstream PowerBI dashboards drove treaty and finance decisions.
- 23 source tables of materially different shapes had to be migrated and continuously transformed without standing up 23 different operational surfaces to test, version, and run on-call against.
- Reporting cadences ran from 15-minute intervals to weekly — the same pipeline had to satisfy near-real-time consumers and slow batch consumers without forking the design.
- Single 4-year program horizon on a Tier-1 carrier — the architecture had to be defensible in front of the carrier's enterprise architecture review, not just internally elegant.
Architecture
Data Model
The source side is a mainframe DB2-LUW schema of 23 tables holding the reinsurance ledger — treaty contracts, cedents, accounting periods, premium and loss postings — with millions of rows per cycle and relationships keyed by treaty + cedent + period. On the target side, Aurora PostgreSQL holds two logical stores: an application database that mirrors the OLTP shape for the live app, and a separate reporting database with a curated, denormalized layer optimized for PowerBI extracts. Between them sits a metadata table that names each of the 23 sources, the per-source transformation parameters, and the consumer cadence — that table is the data structure the generalized Glue jobs actually iterate over. S3 is the event substrate: every PutObject or scheduled trigger names a metadata key, and the Glue pipeline reads that key to know what to extract, transform, and where to land it.
Key Sequence
- An EventBridge schedule or an S3 PutObject fires, carrying a metadata key that identifies which source and cadence to run.
- The generalized Glue extract job reads the metadata row, pulls the configured source slice from Aurora (or, for the one-time historical load, from the DMS-staged data), and writes the raw partition to S3.
- The generalized Glue transform job picks up the raw partition, applies the per-source transformation parameters from the metadata table, and writes the curated partition to S3.
- The generalized Glue load job lands the curated partition in the Aurora reporting database under the target table named in the metadata row.
- PowerBI dashboards pick up the refreshed reporting tables on their next pull, feeding treaty and finance consumers at their configured cadence.
- CloudWatch metrics and job-level logs land in the operational dashboard, so a failure in any of the 3 jobs surfaces with the metadata key that caused it — one bug fix, not 23.
What I owned
- Designed a metadata-driven ETL architecture: 3 generalized Glue jobs (extract, transform, load) covering all 23 source tables — instead of the obvious 3-per-table design (~69 jobs) the team would have built by default
- Architected the dual-pipeline split: AWS DMS for one-time historical migration off mainframe DB2-LUW; Glue/PySpark for ongoing ELT and the reporting layer — no DMS dependency on reporting
- Built the reporting layer end-to-end — event-driven via S3 PutObject and EventBridge schedules, cadences from 15 minutes to weekly per consumer SLA
- Processed millions of rows per cycle across the 23 tables with the generalized pipeline
- Curated reporting datasets imported into PowerBI for downstream treaty and finance consumers
- Owned the design and delivery as Lead Developer across the 4-year arc
Trade-offs
- Chose 3 generalized metadata-driven Glue jobs over ~69 per-table jobs because the operational surface (testing, versioning, on-call) scales with the number of jobs, not the number of tables — the cost is that adding a genuinely unusual source means extending the metadata vocabulary, not just dropping in a new job.
- Split the migration into two pipelines — AWS DMS for the one-time historical lift off the mainframe, Glue/PySpark for ongoing ELT and the reporting layer — rather than running everything through DMS; the cost is two operational tools to learn, but the reporting layer never inherits a DMS dependency it would have to break later.
- Picked an event-driven design over a single nightly batch so consumers from 15-minute to weekly cadence run off the same pipeline; the cost is more EventBridge and S3 plumbing to reason about than a cron-driven monolith.
- Materialized the reporting datasets in Aurora rather than letting PowerBI hit the application database directly, because reporting load on an OLTP shape would have competed with the live app; the cost is a second store to keep reconciled.
- Kept the 23 source tables 1:1 in the application database rather than collapsing them on the way in, because the live app's contract surface had to stay stable while the reporting layer evolved underneath it.
What I'd change today
I'd write the metadata vocabulary down as a first-class artifact on day one — a versioned schema for what a 'source row' in that table means — instead of letting it accrete as the 3 jobs got smarter. We paid for that implicitly every time a new source needed a parameter the existing rows didn't have. I'd also stand up the generalized pipeline on two or three representative sources end-to-end before scaling to all 23; we scaled slightly too early and rediscovered a class of edge cases at table 14 that we could have caught at table 3. And I'd invest earlier in the operational dashboard that maps metadata keys to failures — by the time we built it, we had three months of CloudWatch grep history we shouldn't have needed.