2026

raghunathmanyam.com — Ask My Career

RAG-grounded career chatbot with cost-capped serverless retrieval

Context

Ask My Career is the personal-scale companion to the Modernization Factory's enterprise-scale agentic work — same RAG-and-guardrails shape, applied to a recruiter-facing chatbot that has to tell the truth on $10/month. The architectural calls: a hard cost ceiling enforced atomically at the database (not in Node) so a 1 AM traffic burst can't overspend; a citation contract that forces every claim back to a source page on the same site, making hallucination unsurvivable by construction; sanitization and embedding at ingest, not at query — chunks SHA-256-diffed and only re-embedded when content actually changes, so retrieval at query time stays cheap and predictable. Voyage AI embeds the question against a Supabase pgvector corpus of ~60 chunks built from every MDX project, journal article, behavioral story, ADR, and CV timeline entry; the top-5 chunks become grounded context for a streaming Claude Haiku 4.5 response over SSE. IPs are never stored raw: a daily-salted SHA-256 hash is the rate-limit key, which means same-day rate limiting works but no cross-day visitor tracking is possible by construction. Sole designer and builder.

Constraints

Hard $10/month spend ceiling — the chatbot is a recruiter convenience, not a budget line item; a burst of curious traffic cannot overspend, even at 1 AM when nobody is watching.
Every factual claim must cite the source page on the same site — a chatbot that invents details about my own career would be worse than no chatbot at all; the citation contract is the trust layer.
No raw IP storage anywhere in the data path — rate limiting must work without durable cross-day visitor tracking; same-day same-IP and different-day same-IP must look different in the DB by design, not by policy.
Build deploys cannot break on embedding-pipeline failures — a transient Voyage outage during prebuild must not block a typo fix from shipping; the chatbot is allowed to stay one revision stale, the deploy is not.
Stream the first token within ~1.5s of the user pressing send — recruiters bounce on a static spinner; latency budget is retrieval + first-token, not retrieval + full answer.

Architecture

Loading diagram...

Data Model

Two tables back the RAG path. `documents` holds one row per ~450-token chunk with (source_type, source_id, source_path, title, chunk_index, content, content_hash, metadata) — composite-unique on (source_type, source_id, chunk_index) so reindex is a clean upsert. `embeddings` is 1:1 with `documents` via FK, holding a vector(512) plus the embedding model name — split out so we can re-embed without rewriting metadata, and so the vector column doesn't bloat document scans that don't need it. The retrieval primitive is a PL/pgSQL function `match_documents(query_embedding, match_count, filter_source_type)` that returns the top-N rows by cosine distance (`1 - (embedding <=> query)`) with an optional source-type filter — called from the /chat route as a single PostgREST RPC. Two more tables enforce cost: `chat_usage` is a monthly rollup keyed by 'YYYY-MM' tracking input_tokens, output_tokens, usd_spent, request_count; `chat_rate_limit` is a per-IP daily counter keyed by (day, ip_hash). Both have atomic increment-or-insert PL/pgSQL functions (`increment_chat_usage`, `increment_rate_limit`) because PostgREST doesn't expose increment-on-conflict — these RPCs are how the cost check stays correct under concurrent requests. RLS is on with no permissive policies on every table; only the secret key reaches the data.

Key Sequence

User types a question on /chat; client POSTs { question } to /api/chat as JSON.
Route handler validates length (2–500 chars), then hashes the client IP with the daily salt + session signing secret (sha256, 32-char prefix).
checkSpendCap reads chat_usage for the current YYYY-MM; if usd_spent ≥ $10 the request 429s with a 'reset on the 1st' message before any paid API call.
checkRateLimit reads chat_rate_limit for (today, ip_hash); if count ≥ 20 the request 429s with a 'reset at UTC midnight' message.
retrieveContext calls Voyage with the question, model voyage-3-lite, input_type='query', output_dimension=512 — returns a 512-dim float vector.
match_documents RPC against Supabase returns the top-5 chunks ordered by cosine distance, each with title, source_path, content, similarity.
Route enqueues an SSE 'citations' event with the chunks' (title, path, similarity) so the UI can render the source list while the LLM is still thinking.
incrementRateLimit reserves the IP's quota slot BEFORE the LLM call to prevent concurrent double-spend; a DB hiccup here is fail-soft so a valid question is never blocked by a transient write error.
Anthropic SDK opens a streaming Messages call to Claude Haiku 4.5 with the system prompt (cache_control: ephemeral) and the user message: 'Context: [Source 1] Title\nPath: …\n\n…' joined by '---' separators, then 'Question: …'.
For each content_block_delta the route writes an SSE 'text' event with the delta string; the browser appends to the assistant message and re-renders markdown inline.
On stream end, incrementSpend computes usd from (input_tokens × $1/M + output_tokens × $5/M), upserts chat_usage; the route then emits an SSE 'done' event and closes the controller.
On any failure (Voyage 5xx, Anthropic timeout, controller error), the route emits an SSE 'error' event with the failure message and closes — the client renders the error inline without losing the citation list.

What I owned

The architectural call: every constraint that matters is enforced at the database boundary, not in application code — the cap is a PL/pgSQL atomic RPC, the citation contract is a system-prompt + source-path data invariant, the corpus freshness is a SHA-256 diff at ingest. The model is the easy part; the boundary is the architecture.
Pattern is identical to the Modernization Factory's: sanitize at ingest, query against pre-sanitized chunks, log per-request cost and citations, fail closed on no-context. Different scale, same shape.
Cost guardrails fail before any paid API call: $10/month spend cap + 20 messages/IP/day rate limit, both enforced via atomic PL/pgSQL RPCs that prevent double-spend under concurrent requests
Streaming SSE response over four event types (citations / text / done / error) — citations arrive before the first token so the UI renders sources while the answer is still streaming
Voyage voyage-3-lite at 512 dimensions for both ingest (input_type=document) and query (input_type=query) — the asymmetric input_type is what makes recall work, easy to miss
Idempotent reindex script chunks every source by section (~450 tokens, 200-char overlap), SHA-256 hashes each chunk for change detection, and only re-embeds chunks that actually changed — runs on prebuild, fails soft so a Voyage outage never blocks a deploy
Row-Level Security enabled on every table with zero permissive policies — the publishable key is locked out completely, all DB access goes through server routes using the secret key
Hashed-IP rate limiting (sha256(ip + daily_salt + secret), 32-char prefix) — same IP appears identically within a day, never across days; no durable cross-day visitor tracking is even possible
Deliberately no ANN index on the 512-dim vector column at ~60-chunk corpus size — sequential scan with cosine distance is faster and exact; IVFFlat would produce poor partitions and skip relevant results until the corpus passes ~1000 chunks

Trade-offs

Picked Voyage voyage-3-lite at 512 dimensions over OpenAI text-embedding-3 or BGE because the asymmetric input_type contract (document at ingest, query at retrieval) measurably improves recall on short questions against prose chunks, and 512 dims is the sweet spot for pgvector at this corpus size; the cost is one more API dependency to monitor and a key-rotation story, but Voyage's per-call price is small enough that the spend cap on Anthropic dominates the budget.
Chose Supabase pgvector over a hosted vector DB (Pinecone / Weaviate / Qdrant Cloud) because the corpus is 60 chunks and the same Supabase project already hosts analytics + cost tables; one Postgres, one set of credentials, one RLS policy story — the cost is that I'm responsible for picking the right index strategy myself, which is why there is no ANN index right now and a documented 'revisit at ~1000 chunks' note.
Chose Claude Haiku 4.5 over Sonnet/Opus because the citation contract does most of the work — the model only needs to paraphrase grounded context with inline links, not reason from scratch — and Haiku at $1/$5 per million tokens lets the $10/month cap absorb ~750+ recruiter questions; the cost is that off-topic / multi-hop questions get shorter, more conservative answers, which is the right failure mode for a hiring conversation.
Built the cost guardrail as two atomic PL/pgSQL RPCs (increment_chat_usage, increment_rate_limit) over a 'select then update' from the route handler because concurrent requests from the same IP would race the check and double-spend the quota; the cost is two RPCs to maintain in schema.sql instead of one route function, but the correctness story now lives in the database where it belongs.
Reserved the rate-limit slot BEFORE the LLM call rather than after the response completes — the failure mode of 'two concurrent questions both pass the check, both stream a response, count gets to 21' was worse than the failure mode of 'a request that errors mid-stream still consumed one of the day's 20 slots'; the cost is that errored requests count against the quota, which I think is fine because errors are rare and the cap is generous.
Hard-failed on no-context (503 'no relevant context yet') rather than falling through to an ungrounded LLM call — an answer the recruiter can't click into and verify is worse than a clear 'email me directly' message; the cost is that very early in a corpus's life (or after a Voyage outage during prebuild) the chatbot is unhelpful by design, which is the trade I want.
Made the prebuild reindex fail-soft (exit 0 on Voyage/Supabase error unless REINDEX_STRICT=1) so a transient API outage during deploy doesn't block a typo fix from shipping; the cost is that the chatbot can be one revision stale without a loud signal, which the admin dashboard is supposed to surface — that's a known gap to close.

What I'd change today

I'd move retrieval to a single PostgREST RPC that batches the rate-limit check, the spend check, and match_documents into one round-trip — three sequential calls is the right shape for clarity but the wrong shape for first-token latency on a cold edge function. I'd also stop emitting citations before retrieval similarity has been validated against a quality floor (e.g. drop the chunk if cosine similarity < 0.45) — right now we always return top-5 even when the top result is a weak match, and the model is well-behaved enough to say 'this isn't really covered' but the UI still renders five misleading source pills. I'd add a per-question evaluation pipeline: log the question, the retrieved chunk ids, the response, and a hash of the answer to a separate table, then run a nightly job that flags low-similarity retrievals and ungrounded sentences for me to review — the corpus has gaps, and right now I find them by accident. And I'd cache the system prompt with a longer TTL than ephemeral — the prompt is stable across thousands of requests, and the prompt-cache savings would meaningfully extend the $10 cap.

Stack

Next.js 16TypeScriptClaude Haiku 4.5Voyage AISupabase pgvectorVercelServer-Sent EventsPostgRESTEdge-safe SHA-256Row-Level SecurityAtomic PL/pgSQL RPCs

Decisions behind this project →