Grounding a career site in a vector DB: a $10/month chatbot that has to tell the truth

Building /chat on my own site forced three decisions I usually get to defer on someone else's product: where the cost ceiling lives, what counts as a citation, and how to keep an embedding pipeline from breaking a deploy.

The brief was simple enough to be misleading: put a chatbot on the career site so a recruiter who lands at raghunathmanyam.com can ask “what's the most ambitious project he's led?” and get a sourced answer in under five seconds. The hard part isn't the model. The hard parts are the three constraints an “LLM on your site” framing tends to skip:

The site is a $10/month hobby project. A burst of curious traffic at 1 AM cannot overspend.
Every claim has to cite back to the site. A chatbot that hallucinates about my own career is worse than no chatbot.
The embedding pipeline runs on every Vercel build. It cannot break a deploy when the typo fix is unrelated to the chat path.

Where the cost ceiling actually lives

The naive shape is to check spend inside the route handler before the LLM call. The version of that shape that actually works under concurrent requests is to make the check and the increment atomic at the database, not in Node.

Two tables in Supabase carry this: chat_usage is a monthly rollup keyed by YYYY-MM with usd_spent, input_tokens, output_tokens, request_count; chat_rate_limit is a per-IP daily counter keyed by (day, ip_hash). Both have atomic PL/pgSQL increment-or-insert functions exposed as PostgREST RPCs, because PostgREST doesn't expose increment-on-conflict on a regular upsert. The route handler reads the current value, decides allow / deny, and then — before the LLM call — reserves the slot via the RPC.

The reserve-before-call order matters. If you increment after the response completes, two concurrent requests from the same IP both pass the “count < 20” check, both stream a paid response, and the counter climbs to 21. Reserving first means an errored mid-stream request still consumes one of the day's 20 slots — which I think is the right failure mode, because errors are rare and the cap is generous. The opposite failure mode is a bot spending real money.

The citation contract is the trust layer

The model isn't doing very much reasoning. By design. The system prompt does three things:

Restrict answers to the provided context. No inventing salary expectations, references, relocation willingness, or anything not present in retrieval.
Force every factual claim to cite via a markdown link with the exact path from the retrieved chunk's source_path field — not a guess at the URL, not a paraphrase of the title.
When the context doesn't cover the question, say so and point to r2manyam@icloud.com instead of papering over the gap.

The pieces that make that contract work aren't in the prompt. They're in retrieval. Voyage's voyage-3-lite at 512 dimensions, with one asymmetric detail that's easy to miss: the ingest call passes input_type="document" and the query call passes input_type="query". The two embeddings are tuned for different sides of the same retrieval. Recall on a short recruiter question against prose chunks gets measurably better with the asymmetric setting, and the cost of forgetting it is the kind of bug that doesn't throw — it just quietly returns the wrong neighbors.

The corpus is everything on the site: every *.mdx project, every TSX journal article, every behavioral story, every ADR, every career-timeline entry. The chatbot reads exactly what a recruiter could read by clicking around. Which means the citation always points at a page that actually exists with the content the model paraphrased.

Why there's no vector index

I deliberately did not put an ANN index on the embeddings column. That looks wrong on paper — pgvector ships with IVFFlat and HNSW; the tutorials assume you'll use one. At the corpus size I'm at — about 60 chunks — sequential scan with cosine distance (embedding <=> query) is faster and exact. IVFFlat with lists=100 on under 100 rows produces poor partitions and starts skipping relevant results. HNSW is the right answer at ~1000 chunks; I put a comment in schema.sql saying so. Until then, the planner does the right thing by default.

Streaming the answer before retrieval finishes mattering

The response is Server-Sent Events with four event types:

citations — emitted before the LLM is even called, so the source pills render while the model is still thinking.
text — one per delta from the Anthropic stream; the client appends to the assistant message and re-renders inline markdown links as it goes.
done — clean close, with token counts.
error — the failure mode of choice, written into the same stream so the citation list survives even if the model call dies mid-token.

The user-visible effect is that the moment they press send, five source pills appear and they can already click one. The answer streams in over the next few seconds against the context of knowing what the model is reading from. That ordering does more for perceived trust than any amount of latency tuning.

Reindexing without breaking the build

The reindex script runs on Vercel's prebuild step. It reads every embeddable source, chunks it on paragraph boundaries with ~450 tokens per chunk and 200-char overlap, SHA-256-hashes each chunk's content, and compares against the existing content_hash in documents. Only new or changed chunks get re-embedded. Stale rows (chunks that no longer exist in the source) are deleted; embeddings cascade via FK.

The script is fail-soft on purpose. If VOYAGE_API_KEY isn't set, if Voyage 5xxs, if Supabase is briefly unreachable — it logs and exits zero. The chatbot stays one revision stale. The deploy still ships. Set REINDEX_STRICT=1 to invert that policy if I ever want the build to gate on corpus freshness. So far I haven't needed to.

What I'd do differently

Three things I'd change today, written down so I remember:

Retrieval should be one RPC, not three round-trips. The spend check, the rate-limit check, and match_documents are three sequential PostgREST calls. Clarity per call is good; first-token latency on a cold edge is bad. A single PL/pgSQL function that returns { allowed, reason, citations } would cut the wait before the first byte by about half.

Don't emit citations below a similarity floor. Right now the route always returns top-5 chunks, even when the top result is a weak cosine match. The model handles it — it says “this isn't really covered, email Raghu directly” — but the UI has already rendered five misleading source pills by the time the answer admits the corpus doesn't know. A floor at cosine similarity < 0.45 would fix that.

Log retrieval quality, not just usage. The current tables tell me how much I spent. They don't tell me which questions retrieved garbage. A per-question table with the question, the chunk ids, the cosine scores, and a hash of the answer would let me run a nightly check for low-similarity retrievals and ungrounded sentences. Right now I find the gaps by accident. Principal-level systems don't find their gaps by accident.

Why this is on the site

I could have written this as a Medium post and left it at that. The reason it lives at /chat is the same reason every project I take seriously eventually ends up as a service rather than a slide deck: shipping forces honesty. The cost cap is real because it's real money. The citation contract is enforced because the recruiter clicks the link. The reindex pipeline is fail-soft because the next deploy ships either way. None of those constraints survive contact with a hypothetical chatbot in a Notion doc. They all survive contact with one running at a URL someone is about to paste into a hiring channel.