raghunathmanyam.com — Ask My Career
RAG-grounded career chatbot with cost-capped serverless retrieval
Context
Ask My Career is the personal-scale companion to the Modernization Factory's enterprise-scale agentic work — same RAG-and-guardrails shape, applied to a recruiter-facing chatbot that has to tell the truth on $10/month. The architectural calls: a hard cost ceiling enforced atomically at the database (not in Node) so a 1 AM traffic burst can't overspend; a citation contract that forces every claim back to a source page on the same site, making hallucination unsurvivable by construction; sanitization and embedding at ingest, not at query — chunks SHA-256-diffed and only re-embedded when content actually changes, so retrieval at query time stays cheap and predictable. Voyage AI embeds the question against a Supabase pgvector corpus of ~60 chunks built from every MDX project, journal article, behavioral story, ADR, and CV timeline entry; the top-5 chunks become grounded context for a streaming Claude Haiku 4.5 response over SSE. IPs are never stored raw: a daily-salted SHA-256 hash is the rate-limit key, which means same-day rate limiting works but no cross-day visitor tracking is possible by construction. Sole designer and builder.
Constraints
- Hard $10/month spend ceiling — the chatbot is a recruiter convenience, not a budget line item; a burst of curious traffic cannot overspend, even at 1 AM when nobody is watching.
- Every factual claim must cite the source page on the same site — a chatbot that invents details about my own career would be worse than no chatbot at all; the citation contract is the trust layer.
- No raw IP storage anywhere in the data path — rate limiting must work without durable cross-day visitor tracking; same-day same-IP and different-day same-IP must look different in the DB by design, not by policy.
- Build deploys cannot break on embedding-pipeline failures — a transient Voyage outage during prebuild must not block a typo fix from shipping; the chatbot is allowed to stay one revision stale, the deploy is not.
- Stream the first token within ~1.5s of the user pressing send — recruiters bounce on a static spinner; latency budget is retrieval + first-token, not retrieval + full answer.
Architecture
Data Model
Two tables back the RAG path. `documents` holds one row per ~450-token chunk with (source_type, source_id, source_path, title, chunk_index, content, content_hash, metadata) — composite-unique on (source_type, source_id, chunk_index) so reindex is a clean upsert. `embeddings` is 1:1 with `documents` via FK, holding a vector(512) plus the embedding model name — split out so we can re-embed without rewriting metadata, and so the vector column doesn't bloat document scans that don't need it. The retrieval primitive is a PL/pgSQL function `match_documents(query_embedding, match_count, filter_source_type)` that returns the top-N rows by cosine distance (`1 - (embedding <=> query)`) with an optional source-type filter — called from the /chat route as a single PostgREST RPC. Two more tables enforce cost: `chat_usage` is a monthly rollup keyed by 'YYYY-MM' tracking input_tokens, output_tokens, usd_spent, request_count; `chat_rate_limit` is a per-IP daily counter keyed by (day, ip_hash). Both have atomic increment-or-insert PL/pgSQL functions (`increment_chat_usage`, `increment_rate_limit`) because PostgREST doesn't expose increment-on-conflict — these RPCs are how the cost check stays correct under concurrent requests. RLS is on with no permissive policies on every table; only the secret key reaches the data.
Key Sequence
- User types a question on /chat; client POSTs { question } to /api/chat as JSON.
- Route handler validates length (2–500 chars), then hashes the client IP with the daily salt + session signing secret (sha256, 32-char prefix).
- checkSpendCap reads chat_usage for the current YYYY-MM; if usd_spent ≥ $10 the request 429s with a 'reset on the 1st' message before any paid API call.
- checkRateLimit reads chat_rate_limit for (today, ip_hash); if count ≥ 20 the request 429s with a 'reset at UTC midnight' message.
- retrieveContext calls Voyage with the question, model voyage-3-lite, input_type='query', output_dimension=512 — returns a 512-dim float vector.
- match_documents RPC against Supabase returns the top-5 chunks ordered by cosine distance, each with title, source_path, content, similarity.
- Route enqueues an SSE 'citations' event with the chunks' (title, path, similarity) so the UI can render the source list while the LLM is still thinking.
- incrementRateLimit reserves the IP's quota slot BEFORE the LLM call to prevent concurrent double-spend; a DB hiccup here is fail-soft so a valid question is never blocked by a transient write error.
- Anthropic SDK opens a streaming Messages call to Claude Haiku 4.5 with the system prompt (cache_control: ephemeral) and the user message: 'Context: [Source 1] Title\nPath: …\n\n…' joined by '---' separators, then 'Question: …'.
- For each content_block_delta the route writes an SSE 'text' event with the delta string; the browser appends to the assistant message and re-renders markdown inline.
- On stream end, incrementSpend computes usd from (input_tokens × $1/M + output_tokens × $5/M), upserts chat_usage; the route then emits an SSE 'done' event and closes the controller.
- On any failure (Voyage 5xx, Anthropic timeout, controller error), the route emits an SSE 'error' event with the failure message and closes — the client renders the error inline without losing the citation list.
What I owned
- The architectural call: every constraint that matters is enforced at the database boundary, not in application code — the cap is a PL/pgSQL atomic RPC, the citation contract is a system-prompt + source-path data invariant, the corpus freshness is a SHA-256 diff at ingest. The model is the easy part; the boundary is the architecture.
- Pattern is identical to the Modernization Factory's: sanitize at ingest, query against pre-sanitized chunks, log per-request cost and citations, fail closed on no-context. Different scale, same shape.
- Cost guardrails fail before any paid API call: $10/month spend cap + 20 messages/IP/day rate limit, both enforced via atomic PL/pgSQL RPCs that prevent double-spend under concurrent requests
- Streaming SSE response over four event types (citations / text / done / error) — citations arrive before the first token so the UI renders sources while the answer is still streaming
- Voyage voyage-3-lite at 512 dimensions for both ingest (input_type=document) and query (input_type=query) — the asymmetric input_type is what makes recall work, easy to miss
- Idempotent reindex script chunks every source by section (~450 tokens, 200-char overlap), SHA-256 hashes each chunk for change detection, and only re-embeds chunks that actually changed — runs on prebuild, fails soft so a Voyage outage never blocks a deploy
- Row-Level Security enabled on every table with zero permissive policies — the publishable key is locked out completely, all DB access goes through server routes using the secret key
- Hashed-IP rate limiting (sha256(ip + daily_salt + secret), 32-char prefix) — same IP appears identically within a day, never across days; no durable cross-day visitor tracking is even possible
- Deliberately no ANN index on the 512-dim vector column at ~60-chunk corpus size — sequential scan with cosine distance is faster and exact; IVFFlat would produce poor partitions and skip relevant results until the corpus passes ~1000 chunks
Trade-offs
- Picked Voyage voyage-3-lite at 512 dimensions over OpenAI text-embedding-3 or BGE because the asymmetric input_type contract (document at ingest, query at retrieval) measurably improves recall on short questions against prose chunks, and 512 dims is the sweet spot for pgvector at this corpus size; the cost is one more API dependency to monitor and a key-rotation story, but Voyage's per-call price is small enough that the spend cap on Anthropic dominates the budget.
- Chose Supabase pgvector over a hosted vector DB (Pinecone / Weaviate / Qdrant Cloud) because the corpus is 60 chunks and the same Supabase project already hosts analytics + cost tables; one Postgres, one set of credentials, one RLS policy story — the cost is that I'm responsible for picking the right index strategy myself, which is why there is no ANN index right now and a documented 'revisit at ~1000 chunks' note.
- Chose Claude Haiku 4.5 over Sonnet/Opus because the citation contract does most of the work — the model only needs to paraphrase grounded context with inline links, not reason from scratch — and Haiku at $1/$5 per million tokens lets the $10/month cap absorb ~750+ recruiter questions; the cost is that off-topic / multi-hop questions get shorter, more conservative answers, which is the right failure mode for a hiring conversation.
- Built the cost guardrail as two atomic PL/pgSQL RPCs (increment_chat_usage, increment_rate_limit) over a 'select then update' from the route handler because concurrent requests from the same IP would race the check and double-spend the quota; the cost is two RPCs to maintain in schema.sql instead of one route function, but the correctness story now lives in the database where it belongs.
- Reserved the rate-limit slot BEFORE the LLM call rather than after the response completes — the failure mode of 'two concurrent questions both pass the check, both stream a response, count gets to 21' was worse than the failure mode of 'a request that errors mid-stream still consumed one of the day's 20 slots'; the cost is that errored requests count against the quota, which I think is fine because errors are rare and the cap is generous.
- Hard-failed on no-context (503 'no relevant context yet') rather than falling through to an ungrounded LLM call — an answer the recruiter can't click into and verify is worse than a clear 'email me directly' message; the cost is that very early in a corpus's life (or after a Voyage outage during prebuild) the chatbot is unhelpful by design, which is the trade I want.
- Made the prebuild reindex fail-soft (exit 0 on Voyage/Supabase error unless REINDEX_STRICT=1) so a transient API outage during deploy doesn't block a typo fix from shipping; the cost is that the chatbot can be one revision stale without a loud signal, which the admin dashboard is supposed to surface — that's a known gap to close.
What I'd change today
I'd move retrieval to a single PostgREST RPC that batches the rate-limit check, the spend check, and match_documents into one round-trip — three sequential calls is the right shape for clarity but the wrong shape for first-token latency on a cold edge function. I'd also stop emitting citations before retrieval similarity has been validated against a quality floor (e.g. drop the chunk if cosine similarity < 0.45) — right now we always return top-5 even when the top result is a weak match, and the model is well-behaved enough to say 'this isn't really covered' but the UI still renders five misleading source pills. I'd add a per-question evaluation pipeline: log the question, the retrieved chunk ids, the response, and a hash of the answer to a separate table, then run a nightly job that flags low-similarity retrievals and ungrounded sentences for me to review — the corpus has gaps, and right now I find them by accident. And I'd cache the system prompt with a longer TTL than ephemeral — the prompt is stable across thousands of requests, and the prompt-cache savings would meaningfully extend the $10 cap.