Vector Database
A store that indexes high-dimensional embedding vectors for fast similarity search — the retrieval engine behind semantic search, recommendations, and RAG over your own data.
Also worth naming: Pinecone · Weaviate · Milvus / Zilliz · Qdrant · pgvector (Postgres extension) · Elasticsearch kNN
A vector database answers "what is most similar in meaning?" instead of "what matches these keywords?" It is the piece that lets an LLM answer from your data — embed everything, index the vectors, and retrieve the nearest neighbours in milliseconds.
What it is
A vector database stores and indexes embeddings — fixed-length arrays of floats (often 384–1536 dimensions) produced by an ML model that encode the meaning of text, images, audio, or other content. Items with similar meaning map to vectors that sit close together in that high-dimensional space, so the core query is similarity search: given a query vector, return the k nearest neighbours by distance (cosine, dot product, or Euclidean). That is fundamentally different from a keyword index, which matches exact terms — a vector search for "how do I reset my password" can retrieve a document titled "account recovery steps" even though they share no words.
The hard part the database solves is doing this fast at scale. Exact nearest-neighbour search compares the query against every stored vector — O(N), fine for thousands but hopeless for billions. So vector databases build Approximate Nearest Neighbour (ANN) indexes — most commonly HNSW (a navigable small-world graph) or IVF/PQ — that find almost the true nearest neighbours in sub-linear time, trading a sliver of recall for orders-of-magnitude speed. They also handle the operational parts: storing vectors durably, filtering by metadata alongside the vector search, sharding across nodes, and updating the index as data changes.
The reason this matters now is RAG (retrieval-augmented generation) and semantic features. To make an LLM answer from your private data, you embed your documents, store the vectors, and at query time retrieve the most relevant chunks to feed into the prompt — the vector database is the retrieval engine. In an interview, reach for a vector database when the requirement is semantic similarity, recommendations by embedding, or RAG; do not reach for it for exact lookups, keyword search (that is Elasticsearch), or structured queries (that is a regular database). And know the lightweight on-ramp: pgvector adds vector search to Postgres, so you often do not need a dedicated system until scale demands it.
When to reach for it
Reach for this when…
- Semantic search — match by meaning, not keywords (Q&A, "find similar", dedup)
- RAG: retrieve relevant context to ground an LLM on your own data
- Recommendations / similarity by embedding (similar products, songs, images)
- Multimodal or cross-lingual retrieval where keyword matching fails
Not really this pattern when…
- Exact-match lookups or structured queries (a regular database / key-value store)
- Keyword / full-text search with relevance and facets (Elasticsearch / Postgres FTS)
- Small scale — pgvector inside Postgres avoids a separate system entirely
- You only need filters and aggregations, not similarity (no embeddings involved)
How it works
Four ideas explain how a vector database works and when to use it:
1. Embeddings turn meaning into geometry. A model encodes each item into a vector such that semantic similarity becomes spatial proximity. "Find similar" then means "find nearby vectors," measured by cosine similarity (the usual choice for text — angle, not magnitude), dot product, or Euclidean distance. The whole approach lives or dies on the embedding model: better embeddings put genuinely-similar items closer together.
An embedding model turns text, images, or audio into a fixed-length vector. Semantically similar items land close together in the vector space, so "find similar" becomes "find the nearest vectors" — a geometric query, not a keyword match.
2. ANN indexes make search sub-linear. Comparing a query to every vector is O(N) and does not scale. So the database builds an approximate index — HNSW (a multi-layer proximity graph you greedily traverse, the most popular) or IVF + PQ (cluster into cells, search the nearest cells, compress vectors) — that returns almost the exact top-k far faster. The knob is recall vs latency/cost: tune the index to get, say, 95–99% of the true neighbours at a fraction of the time.
Exact nearest-neighbour search scans every vector — O(N), too slow at scale. Approximate nearest neighbour (ANN) builds an index like HNSW, a layered proximity graph you greedily traverse, trading a little recall for sub-linear search.
3. Metadata filtering combines "similar" with "and". Real queries are "similar to this and in-stock and in this category." Vector databases support filtered ANN — applying metadata predicates alongside the similarity search. Doing this efficiently (pre-filter vs post-filter) is a real differentiator between engines, and a common interview probe.
4. RAG is the headline use, and it is a pipeline. Indexing: chunk documents, embed each chunk, store vectors + metadata. Querying: embed the user's question, retrieve top-k similar chunks, and inject them into the LLM prompt so it answers from grounded context instead of hallucinating. The vector database is the retrieval half; chunking strategy and embedding quality drive how good the answers are.
The user query is embedded and used to retrieve the most relevant chunks from the vector store; those chunks are stuffed into the LLM prompt as context, so the model answers from your data instead of hallucinating.
And the pragmatic on-ramp: the vectors are a derived index of content whose source of truth lives elsewhere, and at modest scale pgvector lets Postgres do this without a new system — you graduate to a dedicated vector database (Pinecone, Milvus, Qdrant) when scale, throughput, or advanced filtering demand it.
Performance envelope
Vector database characteristics — what to reason about.
| Dimension | Reality | Why it matters |
|---|---|---|
| Query | k-nearest-neighbour by vector distance | Similarity, not keyword or exact match |
| Search latency | Milliseconds via ANN (HNSW / IVF) | Exact O(N) scan is too slow past ~thousands |
| Scale | Millions to billions of vectors, sharded | ANN index + sharding is the whole game |
| Recall vs speed | Tunable (e.g. 95–99% recall) | Approximate — trade a little accuracy for speed |
| Dimensions | Typically 384–1536 (model-dependent) | Higher dims = more memory + compute per vector |
| Memory | Index often held in RAM (HNSW) | Vectors are big; memory is the cost driver |
Capabilities in interviews
Semantic search
Retrieve by meaning so queries match relevant content that shares no keywords.
Embed the corpus and the query into the same space, then return the nearest vectors:
query "reset my password" → embed → ANN search → "account recovery guide" (top hit)Because matching is semantic, synonyms, paraphrases, and cross-lingual queries all work without manual synonym lists. This is the upgrade over keyword search when users describe what they want in their own words. Often combined with keyword search (hybrid search) so exact terms and semantic intent both count.
Choose this variant when
- Natural-language Q&A and help search
- "Find similar" / near-duplicate detection
- Cross-lingual or paraphrase-tolerant retrieval
RAG retrieval for LLMs
Ground an LLM on your private data by retrieving relevant context per query.
The retrieval engine behind retrieval-augmented generation:
index: docs → chunk → embed → store (vector + metadata)
query: question → embed → top-k chunks → prompt = question + chunks → LLM answersThe LLM answers from retrieved context instead of its frozen training data, which reduces hallucination and lets it cite your documents. Retrieval quality (chunking, embedding model, top-k, filtering) usually matters more than the LLM itself — which is exactly why the vector database choice and tuning are interview-worthy.
Choose this variant when
- LLM Q&A over private / domain documents
- Chatbots that must cite internal knowledge
- Reducing hallucination by grounding in real data
Embedding-based recommendations
Recommend items whose embeddings are nearest to a user or item vector.
Represent users and items as vectors (from behaviour or content) and recommend by proximity:
"more like this product" → product embedding → nearest item vectors → recommendationsThis powers "similar products," "songs like this," and content-based candidate generation in a recommender. It is the candidate-generation stage — pull a few hundred nearby items fast, then a heavier ranking model re-scores them. Vector search is what makes that first nearest-neighbour pass scale to huge catalogs.
Choose this variant when
- Content-based "similar items" recommendations
- Candidate generation in a two-stage recommender
- Personalisation by user/item embeddings
Filtered & hybrid search
Combine similarity with metadata filters and keyword relevance for precise retrieval.
Production queries blend "similar" with hard constraints and exact terms:
similar_to(query_vec) AND category = "docs" AND updated_after = 2025
hybrid: 0.7 × semantic_score + 0.3 × keyword(BM25) scoreFiltered ANN applies metadata predicates during the vector search; hybrid search fuses vector similarity with keyword relevance so both exact matches and semantic intent rank well. Getting filtering efficient (and fusion sensible) is where engines differ — and a frequent follow-up in interviews.
Choose this variant when
- Similarity plus hard metadata constraints
- Combining exact-term and semantic relevance
- Multi-tenant retrieval (filter by tenant + similar)
Operating knobs
Embedding model & dimensionality
The biggest quality lever — retrieval is only as good as the embeddings. Pick a model suited to the domain and language; higher dimensions can capture more nuance but cost more memory and compute per vector. Critically, the same model must embed both documents and queries, and re-embedding the whole corpus is required if you change models.
ANN index type & parameters
HNSW (graph) gives excellent recall and low latency at higher memory; IVF + PQ clusters and compresses for lower memory at some recall cost. Parameters (HNSW M/efSearch, IVF nprobe) tune the recall-vs-latency trade — higher search effort finds more true neighbours but costs more time. Set them to your accuracy requirement, not by default.
pgvector vs a dedicated vector DB
pgvector adds vector columns + ANN to Postgres — keep vectors transactionally consistent with your data and avoid a new system, ideal up to millions of vectors and moderate QPS. Graduate to a dedicated engine (Pinecone, Milvus, Qdrant, Weaviate) for billions of vectors, very high query throughput, advanced filtering, or managed scaling.
Chunking & metadata (for RAG)
How you split documents into chunks decides what can be retrieved: too large and chunks are noisy and dilute relevance; too small and they lose context. Store rich metadata (source, section, timestamp, tenant) alongside each vector for filtering and citations. Chunking strategy is often the highest-leverage RAG tuning knob.
Versus the alternatives
Vector database vs the alternatives.
| Dimension | Vector DB | Elasticsearch | pgvector (Postgres) |
|---|---|---|---|
| Query type | Semantic similarity (kNN) | Keyword + facets (+ kNN add-on) | Similarity inside SQL |
| Scale | Billions of vectors, purpose-built | Billions of docs (text-first) | Up to millions comfortably |
| Strength | ANN performance + filtering at scale | Full-text relevance, aggregations | No new system; transactional |
| Best for | RAG, semantic search, recsys at scale | Keyword search, logs, hybrid | Adding semantic search to an app |
| When to pick | Similarity is the core, large scale | Text relevance is the core | Modest scale, already on Postgres |
Failure modes & gotchas
The vectors are a derived index of content whose source of truth lives elsewhere (a DB, object storage). Re-embedding (new model) or index corruption means rebuilding from the source. Keep the authoritative documents in a real store and treat the vector index as rebuildable — never the only copy.
Documents and queries must be embedded by the same model (and version) or their vectors live in different spaces and similarity is meaningless. Changing the embedding model requires re-embedding the entire corpus; mixing models silently wrecks retrieval quality.
ANN is approximate — it can miss true neighbours. Default index settings may give poor recall (missing relevant results) or needless latency. Measure recall against an exact baseline on a sample and tune efSearch/nprobe to hit your accuracy target deliberately, rather than assuming the results are exact.
In RAG, retrieval quality is dominated by chunking and embedding, not the LLM. Chunks that are too large dilute relevance and blow the context budget; too small lose meaning. Bad chunking produces irrelevant context and confident-but-wrong answers — tune chunking and top-k before blaming the model.
If users search by exact terms, codes, or structured filters, keyword search (Elasticsearch / Postgres FTS) is simpler and better — embeddings add cost and approximation for no benefit. And at small scale, pgvector avoids a whole separate system. Use vectors specifically when semantic similarity is the requirement.
In production
Spotify
Voyager / ANN for music and podcast recommendations
Spotify pioneered embedding-based recommendations and open-sourced Annoy, an approximate-nearest-neighbour library, years before "vector database" was a category — and now runs newer ANN systems (Voyager) in production. Songs, podcasts, and users are represented as embeddings, and recommendations ("Discover Weekly," "more like this") come from finding the nearest vectors in that space across catalogs of tens of millions of tracks.
It's the canonical pre-LLM vector use case and shows the candidate-generation pattern: ANN search pulls a few hundred nearby items fast, then a heavier ranking model re-scores them. The takeaway maps straight to the page: similarity becomes geometry, exact nearest-neighbour over tens of millions of items is too slow, so an ANN index gives sub-linear retrieval — the foundation of recommendation at scale, long before RAG made vectors famous.
Microsoft / OpenAI
RAG over enterprise data — vectors as the retrieval layer for LLMs
The current wave is RAG: grounding LLMs on private data so they answer from your documents instead of hallucinating. Microsoft's Azure AI Search and OpenAI's assistant tooling both center on a vector store as the retrieval layer — enterprises embed their documents, store the vectors, and at query time retrieve the most relevant chunks to feed the model, with citations. This pattern now underpins countless internal copilots and customer-support bots.
The architecture is exactly this page's RAG pipeline, and the engineering lessons are consistent across the industry: retrieval quality (chunking, embedding model, top-k, filtering) matters more than the LLM choice; the vector index is a derived view rebuildable from the source documents; and filtered/hybrid search (vector + keyword + metadata) is what makes enterprise retrieval precise and secure (e.g. filtering by tenant). It's why vector databases went from niche to a core building block almost overnight.
Good vs bad answer
Interviewer probe
“Build a support assistant that answers user questions from a company's 50,000 internal help articles, with citations. How does retrieval work?”
Weak answer
"Feed the user's question to an LLM and let it answer. If it doesn't know, keyword-search the articles with a LIKE query and paste any matches into the prompt."
Strong answer
"This is RAG with a vector database for retrieval. Indexing: chunk each article into passages, embed every chunk with an embedding model, and store the vectors plus metadata (article id, title, section, url) in a vector store — pgvector if we're already on Postgres and the scale is modest, a dedicated engine like Pinecone or Qdrant if we need billions of vectors or high QPS. Query: embed the user's question with the same model, run a filtered ANN search (HNSW) for the top-k most similar chunks, and inject those chunks into the LLM prompt as grounding context, returning the source urls as citations. Keyword LIKE fails here because users ask in their own words — 'I can't get back into my account' should match an article titled 'Password reset', which shares no keywords; semantic similarity catches that. I'd tune chunking and top-k for retrieval quality (that matters more than the LLM choice), measure recall against an exact baseline, and consider hybrid search so exact product names still rank. The vector index is a derived view I can rebuild from the articles if I re-embed."
Why it wins: Names RAG + vector DB, lays out the index/query pipeline, uses the same embedding model for both sides, does filtered ANN with citations, picks pgvector vs dedicated by scale, explains why keyword search fails, and flags chunking/recall/hybrid as the real tuning levers.
Interview playbook
When it comes up
- RAG / LLM-over-your-data, chatbots that cite internal knowledge
- Semantic search — match by meaning, "find similar", dedup
- Embedding-based recommendations / candidate generation
- The interviewer says "answer from these documents" or "search by meaning"
Order of reveal
- 11. Frame it as similarity, not keywords. The query is "what is closest in meaning," so I embed content into vectors and do nearest-neighbour search — a vector database.
- 22. Index pipeline. Chunk and embed the corpus, store vectors plus metadata; the source documents stay the system of record.
- 33. Query pipeline. Embed the query with the same model, run ANN (HNSW) for top-k, filter by metadata, and (for RAG) inject the chunks into the LLM prompt.
- 44. ANN + recall trade. Exact search is O(N); ANN gives sub-linear search, and I tune recall vs latency to my accuracy target.
- 55. Right-size it. pgvector if we are already on Postgres at modest scale; a dedicated engine (Pinecone/Milvus/Qdrant) for billions of vectors or high throughput.
Signature phrases
- “Match by meaning, not keywords — nearest vectors, not exact terms.” — States the core difference from a keyword index.
- “ANN trades a little recall for sub-linear search.” — Shows you understand the central performance trade.
- “Same embedding model for documents and queries, or similarity is meaningless.” — Catches the most common correctness mistake.
- “pgvector first; a dedicated vector DB when scale demands it.” — Avoids over-engineering at small scale.
Likely follow-ups
?“How do you combine a similarity search with a hard filter like "only this tenant's docs"?”Reveal
Use filtered ANN — the vector store applies the metadata predicate alongside the similarity search. There are two strategies: pre-filtering restricts the candidate set to matching vectors before the ANN traversal (accurate, but can be slow if the filter is very selective and fragments the graph), and post-filtering runs ANN first and drops non-matching results (fast, but may return fewer than k if many top hits are filtered out). Good engines do filtered search natively and pick a strategy by selectivity. For strict multi-tenant isolation I store tenant_id as metadata and filter on it every query (or even shard/namespace by tenant), so one tenant can never retrieve another's vectors.
?“ANN is approximate — how do you make sure you are not missing relevant results?”Reveal
You measure and tune. Build an exact k-NN baseline on a sample of queries (brute-force is fine offline), then compute recall@k — what fraction of the true neighbours your ANN index returns — and raise the search effort (efSearch for HNSW, nprobe for IVF) until recall meets your target, accepting the added latency. There is a real recall-vs-latency curve; you pick a point on it deliberately rather than assuming the index is exact. For RAG specifically, slightly imperfect recall is usually fine because the LLM tolerates a few extra/missing chunks, so you can favour latency; for dedup or compliance search you push recall higher.
?“When would you just use pgvector or Elasticsearch instead of a dedicated vector DB?”Reveal
pgvector when I am already on Postgres, the scale is up to a few million vectors, and I value keeping embeddings transactionally consistent with the rest of my data without operating a new system — it does HNSW/IVF inside Postgres and is often all an app needs. Elasticsearch when keyword relevance and facets are the primary need and vectors are a secondary signal — it has kNN built in, so hybrid (BM25 + vector) search lives in one system. I reach for a dedicated vector database (Pinecone, Milvus, Qdrant, Weaviate) when I have hundreds of millions to billions of vectors, need very high query throughput, advanced filtering, or want managed scaling and purpose-built ANN performance. The progression is pgvector → Elasticsearch-hybrid → dedicated, driven by scale and how central similarity is.
Worked example
Setup. Build a support assistant that answers user questions from a company's 50,000 internal help articles, in natural language, with citations to the source articles.
The move. This is RAG with a vector database for retrieval. Indexing: chunk each article into passages (a few hundred tokens with slight overlap), embed every chunk with an embedding model, and store the vectors plus metadata (article id, title, section, url) in the vector store. Query: embed the user's question with the same model, run an ANN search (HNSW) for the top-k most similar chunks, and inject those chunks into the LLM prompt as grounding context — returning the source urls as citations.
Why not keyword search. A user asks "I can't get back into my account"; the relevant article is titled "Password reset" — zero shared keywords, so a LIKE or even BM25 query misses it. The embeddings of the question and the article land near each other because they mean the same thing, so semantic similarity retrieves it.
Right-sizing the store. At 50K articles → maybe a few hundred thousand chunks, pgvector inside Postgres is plenty and keeps the vectors consistent with the rest of the data — no separate system. I'd graduate to a dedicated engine (Pinecone, Qdrant, Milvus) only at hundreds of millions of vectors or high QPS.
What breaks. Retrieval quality dominates the LLM here, and the top failure is chunking — chunks too large dilute relevance and blow the context budget; too small lose meaning. I tune chunking and top-k first. I also keep document and query embeddings on the same model (mixing models makes similarity meaningless), measure recall against an exact baseline so ANN approximation isn't silently dropping relevant chunks, and add hybrid search so exact product names still rank.
The result. Natural-language answers grounded in the company's real articles with citations, retrieval that matches meaning not keywords, on pgvector at this scale — and a clear path to a dedicated vector DB if it grows.
Cheat sheet
- •Vector DB = stores embeddings, answers k-nearest-neighbour similarity queries (meaning, not keywords).
- •Embeddings map meaning → geometry; similar items = nearby vectors (cosine / dot / Euclidean).
- •ANN indexes (HNSW graph, IVF+PQ) make search sub-linear — approximate, tune recall vs latency.
- •Same embedding model must encode both documents and queries, or similarity is meaningless.
- •RAG: chunk → embed → store; query → embed → top-k → inject into LLM prompt → grounded answer + citations.
- •Filtered ANN combines similarity with metadata predicates; hybrid fuses vector + keyword (BM25).
- •Derived index, not source of truth — rebuildable from the source documents (re-embed if model changes).
- •pgvector for modest scale on Postgres; dedicated (Pinecone/Milvus/Qdrant) for billions / high QPS.
Drills
Why is a vector search better than a keyword search for a natural-language Q&A assistant?Reveal
Because users ask in their own words, which rarely match the exact terms in the source documents. A keyword search for "I can't get back into my account" would miss an article titled "Password reset" — they share no keywords — whereas an embedding of the question lands near the embedding of that article because they mean the same thing. Vector search matches semantic similarity, so synonyms, paraphrases, and even other languages retrieve the right content without hand-maintained synonym lists. Keyword search is still better for exact terms (product codes, error strings), which is why production systems often run hybrid (vector + keyword) search.
Interviewer: "your similarity search is too slow at 100M vectors. What's wrong and what do you do?"Reveal
You are almost certainly doing exact nearest-neighbour search, which compares the query against all 100M vectors — O(N), hopeless at that scale. The fix is an ANN index: HNSW (a navigable proximity graph traversed greedily) or IVF+PQ (cluster into cells, search only the nearest cells, compress vectors), which return approximate nearest neighbours in sub-linear time — milliseconds instead of seconds. You tune the index parameters (efSearch/nprobe) to hit your recall target, shard across nodes if it exceeds one machine's memory (HNSW indexes are RAM-resident), and consider product quantization to shrink the vectors. The trade is a small, measurable loss of recall for a massive speedup.
In a RAG system, retrieval quality is poor — the LLM gives vague or wrong answers. Where do you look first?Reveal
At retrieval, not the LLM — in RAG, retrieval quality dominates. Check, in order: (1) chunking — chunks too large dilute relevance and waste the context budget, too small lose meaning; this is usually the highest-leverage knob. (2) embedding model — is it suited to the domain/language, and are queries and documents embedded by the same model? (3) top-k and recall — are you retrieving enough relevant chunks, and is your ANN recall high enough (measure against an exact baseline)? (4) filtering/freshness — are stale or off-topic chunks being retrieved that should be filtered out? Only after retrieval is solid do you tune the prompt or model. Bad chunking producing irrelevant context is the classic root cause of confident-but-wrong RAG answers.
What it is