How to Answer "Design a RAG System" in an AI Interview

"Walk me through how you would design a RAG system for [legal documents / customer support tickets / medical records]." In 2026, this has replaced "design a URL shortener" as the single most common senior-level system design question in tech interviews. And most candidates answer it badly — not because they lack knowledge, but because they answer as if they are writing a tutorial instead of designing a production system.

What the Interviewer Is Actually Testing

The question looks like a technical design question. It is really a four-part judgment test:

Do you know the building blocks of retrieval and generation?
Can you reason about failure modes before you reason about architecture?
Do you think about evaluation as a first-class concern, or as an afterthought?
Can you make concrete tradeoffs under the specific constraints of the domain?

A candidate who names every component but skips evaluation loses to a candidate who names fewer components but says "here is how I would know it is working."

The 7-Layer Answer Framework

Structure your answer around these seven layers, in order. Name each one out loud as you go — the interviewer is taking notes, and structure is half the battle.

Step 0 — Clarify the Domain and the Failure Tolerance

Before you draw a single box, ask three questions:

What is the corpus? Size, update frequency, format (text, PDFs, tables, code)?
Who are the users and what are they asking for?
What is the cost of a wrong answer? Hallucinations on a marketing assistant are annoying. Hallucinations on a medical Q&A tool can hurt someone.

Senior candidates almost always open with clarifying questions. Junior candidates jump straight to "so I'd use LangChain."

Step 1 — Ingestion

Where does the data come from? Batch import, streaming, change-data-capture?
How do you parse it? Document parsers differ wildly in quality for PDFs, tables, and scanned docs.
How do you detect and handle duplicates, near-duplicates, and stale versions?

Failure modes to name: bad OCR, missing tables, silent data drift when the source updates.

Step 2 — Chunking

Fixed-size? Sentence-based? Semantic? Document-structure-aware?
How long is each chunk? How much overlap?
What metadata travels with each chunk (doc ID, section, timestamp, access control)?

Failure modes to name: an answer split across two chunks, chunks that lose critical context (a footnote, a table header), chunking too coarse so retrieval brings back irrelevant paragraphs.

Step 3 — Embedding

Which embedding model? Open-source or hosted? Domain-adapted?
Where are the embeddings stored? Postgres + pgvector? Pinecone? Qdrant? A self-hosted HNSW index?
How do you handle re-embedding when the model changes?

Failure modes to name: embedding drift after a model upgrade, cost blowups from re-embedding a large corpus, languages your embedding model handles badly.

Step 4 — Retrieval

Pure vector search? Hybrid (BM25 + dense)?
Metadata filters (by user, doc type, date)?
Access control — critical if users should not see each other's documents.

Failure modes to name: acronym-heavy queries where BM25 beats vectors, freshness bias when new docs are under-indexed, users retrieving documents they should not be able to see.

Step 5 — Reranking

Cross-encoder reranker on the top-K results?
LLM-based reranking for small K, expensive but high-quality?
Do you measure whether it actually helps or just assume?

Naming reranking — and being willing to measure whether it is worth its cost — is a strong signal you have built RAG in production.

Step 6 — Generation with Citations

How is the prompt constructed? System role, retrieved context block, user question, citation instruction.
Do you require citations in the output? How do you enforce them?
Do you have a "grounding-check" step that verifies the generated answer is actually supported by the retrieved context?
What is your fallback if the retriever returns nothing useful — do you hallucinate or do you say "I don't know"?

Failure modes to name: confident wrong answers, citations that point to the wrong passage, "plausible but unsupported" claims that a grounding check would catch.

Step 7 — Evaluation and Observability

If you only remember one thing from this tip: do not end your answer without explaining evaluation. This is where most candidates lose the round.

A golden set of at least 50 labeled Q&A pairs, ideally hundreds for a serious system.
Metrics: faithfulness (is the answer grounded in the context?), context precision (did we retrieve the right stuff?), answer relevance (did we actually answer the question?).
LLM-as-judge for scaling, with human review on a sample.
Per-query logging: query, retrieved chunks, generated answer, judge score.
A dashboard. A rollout gate in CI that blocks prompt changes with regressions.

What to Say When You Do Not Know

You will almost certainly be asked a question where you do not have a great answer — "what embedding model would you pick for multilingual legal text in 12 languages?" Senior candidates do not bluff. Try this script:

"Honestly, I'd want to test three options on a labeled set from this exact domain before committing. My priors are X because Y, but I've seen benchmarks swap rankings enough in this space that I wouldn't trust my priors over a head-to-head eval."

This answer wins the round. Confident bluffing loses it.

The 60-Second Mental Template

If you want one thing to memorize for the room: ingest → chunk → embed → retrieve → rerank → generate with citations → evaluate. At each step, name the most common failure mode. End with "and here is how I would know it's working" — your evaluation plan. That structure alone, even with imperfect technical detail, outperforms 80% of candidates.

Red Flags Interviewers Listen For

Jumping to "I'd use LangChain" without discussing the design.
Never mentioning evaluation.
Treating the LLM as a black box that "just knows" things.
Overclaiming certainty about model behavior.
No mention of cost, latency, or operations.

Practice This Out Loud

Reading this tip is not enough. Before your interview, record yourself answering the question — unprompted, for five minutes, using only the seven layers above. Play it back. Cut the filler. Tighten the structure. Do it three times. The difference between candidates who answer this question well and candidates who struggle is almost entirely practice, not knowledge.

About the Author

CT

Capcheck Team

AI Interview Platform

The Capcheck team analyzes thousands of AI engineer interview loops every quarter and helps candidates prepare for the questions that actually get asked.

Found this tip helpful? Share it!