The default mental model for RAG goes: take all the data, embed it, stuff it in a vector DB, retrieve, prompt, done. That model works for the demo. It quietly breaks the moment your system has to be correct about anything that changes — names, counts, prices, dates, statuses, anything a user might check against the live system five minutes after you embedded it.
The interesting design decision in a serverless RAG isn't how you chunk or what model you embed with. It's which piece of context goes where. There are three places, each with a different cost shape and a different staleness story, and the discipline is knowing which one each piece of data belongs in before you run the first embedding job.
Three stores, three jobs
Every serverless RAG I've built (or watched someone else regret) eventually settles into the same three-store shape:
- Vector DB — semantic-retrievable prose. The answer to “what's the most relevant thing I've written about X.” Optimized for fuzzy match by meaning, not by key.
- Structured store — exact-lookup facts. Anything with a stable key that a SQL query would answer faster and more correctly than a cosine-similarity search. Names, dates, IDs, statuses, counts, configuration.
- Live API — anything that has to be authoritative-as-of-now. Account balances, current ticket state, “is the deploy still running,” anything a user can refresh and see change.
Each store has a job. Vector DB's job is fuzzy semantic recall. Structured store's job is exact factual retrieval. Live API's job is freshness. You don't want any of them doing any of the others' jobs, because each is bad at the other two by design.
A working example: the chatbot on this site
The /chat RAG behind this site uses all three. Looking at where each piece actually lives clarifies the pattern faster than any taxonomy:
- In the vector DB (Supabase pgvector, ~60 chunks): every MDX project page, every journal article, every behavioral story, every architecture decision record. These are prose. A recruiter asks “what's the most ambitious project he's led?” — that question has no exact key. Fuzzy-match by meaning is exactly what's needed, and freshness measured in days is fine because these texts change at editorial speed.
- In the structured store (Supabase Postgres tables): the cost ceiling counter, the per-IP rate-limit counter, the system-prompt version, the embedding-model version, the current canonical job title, the contact email. A SQL row lookup keyed by date or version is a few microseconds and never lies; embedding any of these would be slower and stochastically wrong some fraction of the time.
- Live, on every request: nothing today, but the architectural slot exists for things like “has the daily rate limit reset since I last counted.” The structured counter sits inside the same transaction as the increment, so today the answer comes from the structured store. If/when there's a piece of context that has to reflect now (current job availability, an updated calendar), it goes here — not in the vector DB just because vector DBs are convenient.
The trap I see most often is the third bullet collapsing into the first. Someone notices the vector DB is “already there,” embeds whatever feels relevant, and accepts the stale-data tax because writing a query feels heavier than writing an embedding. That's how a chatbot ends up confidently citing a job title from six months ago.
The deciding question: how stale can this be before it's a lie?
Every piece of context has a freshness window — how long can the retrieved value be wrong before the user notices, or before someone notices for them. The window picks the store:
- Days or weeks — vector DB is fine. Re-embed on content change, ship the cycle through your deploy pipeline.
- Minutes to hours — structured store with a write path. The structured store updates as fast as a SQL insert; embedding round-trips would dominate latency and add inconsistency.
- Seconds, or has-to-be-now — live API. Don't even cache. The model gets the answer at request time, with a source attribution that names the API and the timestamp.
Writing the freshness window down explicitly per data type is the single highest-leverage discipline in serverless RAG design. It's also the one nobody documents. Most systems pick the store by what's familiar, not by what the data requires.
The cost shape — each store has a different curve
Cost is the other dimension that pushes data toward one store or another, and the three shapes are dramatically different:
- Vector DB — cost lives at ingest (the embedding API call, once per chunk per revision) and is near- free at retrieval (cosine similarity over a row scan, until the corpus gets big enough to need an ANN index). Linear in corpus size, sublinear in query rate.
- Structured store — cost is essentially free on both sides. A PostgREST row lookup is a Postgres index scan. The database is already there. The marginal cost of one more piece of state in a structured table is rounding error.
- Live API — cost is per-request, paid by the end user's patience and by your provider quota. A GitHub-rate-limited fetch on every chat request is fine until there's a viral hour and you're 429'd.
The cost-aware version of the staleness rule: if a piece of data is freshness-tolerant but expensive to embed, push it to the structured store rather than re-embedding on every change. If a piece of data is freshness-sensitive but rate-limited at the source, accept the freshness cost and structure-store it with a short TTL. The store is picked by the intersection of “how stale before it lies” and “how much does each retrieval cost,” not by either one alone.
The correctness shape — what kind of wrong can each store be?
The failure mode is the part nobody talks about until the bug report lands:
- Vector DB returns the wrong-but-similar answer. Two projects with similar tech stacks; the wrong one comes back as the top result; the model paraphrases credibly. This is a feature when retrieving prose, and a bug when retrieving facts.
- Structured store returns stale data. The row is exactly correct as of its last write. If the write hasn't happened, the row is silently old. The failure is invisible unless you've attached a freshness check.
- Live API returns wrong-or-throws. Network partitions, rate limits, schema drift. The failure is loud — which is in some ways the best failure mode, because you have to handle it explicitly.
Match the failure mode to the data's tolerance. The recruiter's question “what tech stack does Cash Clearance use” can survive a slightly-wrong vector match — the model will still produce a defensible paraphrase and cite the right page. The recruiter's question “what email do I reach you at” cannot survive a fuzzy match; wrong-but-similar is wrong. That one goes in the structured store, retrieved exactly, returned verbatim.
The integration shape — one prompt, three sources
The model only sees one prompt. The architecture's job is to assemble that prompt from the three stores in a way that preserves where each piece came from, so the cite path works and so a future eval can attribute a wrong answer to the right store.
The pattern I land on: a small assembler function in the route handler that takes the user's question, fans out to vector retrieval + structured fetch + (optional) live calls in parallel, then composes the system message with explicit labeled sections (“Relevant prose:”, “Verified facts:”, “Live state at request-time:”). Each section can carry its own citation format — vector chunks cite the source path; structured facts cite the table and key; live calls cite the API and the timestamp. The model knows where each fact came from, and so does the audit trail.
The parallel fanout matters. The cold-path latency for a serverless chat handler is dominated by the slowest of these three calls, not the sum. Doing them sequentially because it's simpler to read is the optimization not made.
What I'd change today
Three things this pattern gets only partially right, written down so I'll remember to revisit:
The freshness contract isn't machine-checked. Right now the “how stale can this be” rule lives in my head per data type. The principal-level move is to attach a freshness budget to each chunk or row at ingest time and have a nightly job that flags any data whose last-updated timestamp is older than its declared budget. Lets compliance answer “is anything in the retrieval surface lying right now” without a manual audit.
The composition order isn't tunable. The prompt always shows vector chunks first, then structured facts, then live state. For most questions that's right. For questions where the structured facts dominate (“how do I contact you”), the vector chunks act as noise the model has to filter past. A small classifier that decides the composition order based on question intent would be cheap and directly observable in retrieval-quality evals.
The live-API slot is theoretical, not exercised. The architecture has a slot for live calls but my current site doesn't use it. That's honest, but it also means the first time I add a live source I'll discover the failure modes in production. Even a placeholder live source (current UTC time as the simplest example) would let me wire and harden the integration path now, before it has to carry real traffic.
Why the answer isn't “put it all in the vector DB”
Putting everything in the vector DB feels architecturally clean. One store, one retrieval mechanism, one mental model. It scales as a demo and falls over as a system, and the failure is diffuse: small inaccuracies, dated answers, occasional confidently-wrong claims that nobody can quite pin to a single bug. Splitting by store is more architecture up front in exchange for failures that are localized and debuggable when they happen.
The discipline is the same one that shows up everywhere in systems design: pick the storage primitive whose semantics match the data's actual properties, not the primitive that's newest or most convenient. A vector DB is good at fuzzy semantic recall. A structured store is good at exact facts. A live API is good at “tell me now.” The interesting work is matching each piece of context to the store whose strengths line up with what the data needs to be.
Companion reading: the earlier post on grounding this site in a vector DB covers the cost ceiling and citation contract pieces in more detail. The compliant-LLM reference architecture applies the same store-selection discipline to a regulated workload, where the freshness and audit constraints push more data toward the structured store than this site needs to.