Search over content
Inverted index + ranking service. The hard part isn't indexing — it's relevance, freshness, and a rebuild path.
When to reach for this
Reach for this when…
- Full-text search over a corpus (millions+ of docs)
- Faceted / filtered search on e-commerce
- Log or metric exploration UIs
- Autocomplete / type-ahead
- "Search for…" in the prompt — almost always this
Not really this pattern when…
- Lookup by exact key (that's a primary index)
- Dataset under ~10k rows (SQL with a B-tree is fine)
- Structured-only queries on known columns (SQL/Postgres FTS wins)
- Vector / semantic similarity only (that's an ANN / embedding problem — though hybrid is common)
Good vs bad answer
Interviewer probe
“Design product search for 10M products with filters, ranking, and near-real-time updates.”
Weak answer
"SQL with ILIKE '%term%' and order by a relevance column."
Strong answer
"Postgres stays the source of truth. Below ~10M docs I'd genuinely consider Postgres FTS first, but with facets and custom ranking on the table I'll use Elasticsearch as a derived index. Debezium CDC publishes product changes to Kafka; an idempotent indexer upserts by product_id — never a dual write. Fields: standard + English stemming on name and description; keyword (un-analysed) on category, brand, SKU for exact filters and facets; numeric on price and rating for range and sort. Query: multi_match with field boosts for text relevance, category/price as filter clauses that narrow but don't score (and are cacheable), function_score adding popularity (log1p sales) and recency decay on top of BM25. For scale I run two-phase ranking — BM25 retrieves ~1000 candidates, then a re-ranker applies personalisation to just those. Reindex is the alias pattern: build products-v5 with the new mapping, bulk reindex from the DB, validate against a judged query set, flip the products alias atomically, keep v4 for rollback. Pagination uses search_after cursors, not deep offsets. I monitor CDC/indexer lag and fall back to the DB for read-your-write on a just-created product. If we see vocabulary mismatch I'd add dense-vector retrieval fused with RRF."
Why it wins: Names truth-vs-derived, CDC over dual-write, per-field analyzers, the filter-vs-score split, BM25 + boosts, two-phase ranking, the alias rebuild, cursor pagination, lag monitoring, and the hybrid escape hatch — and offers Postgres FTS as the simpler option. The weak answer reaches for a full table scan with no relevance.
Cheat sheet
- •DB = truth. Search index = derived view you can always rebuild.
- •Never dual-write. CDC / outbox feeds an idempotent indexer.
- •Indexing is async — accept seconds of lag, monitor it, DB fallback for read-your-write.
- •<10M docs: Postgres FTS (tsvector + GIN) is often enough.
- •>10M or facets / custom scoring: Elasticsearch / OpenSearch.
- •Analyzer per field: text+stemming, keyword (exact), numeric (range/sort).
- •Filters narrow and are cacheable; scoring clauses rank. Keep them separate.
- •BM25 is the base score; function_score boosts popularity + recency. Boost, don't replace.
- •Two-phase: cheap BM25 retrieval → expensive re-rank over the top-K.
- •Alias pattern for atomic reindex + rollback (blue/green for search).
- •search_after cursors, never deep from/size offsets.
- •Vocabulary mismatch? Add dense-vector retrieval fused with RRF.
Core concept
Search engines — Elasticsearch, OpenSearch, Meilisearch, Typesense, Vespa, Postgres FTS — are all built on the inverted index: for every term, a posting list of (doc_id, term_frequency, positions). A query looks up the posting lists for its terms, intersects (AND) or unions (OR) the doc_ids, scores each with BM25, sorts, and returns the top K. The index is expensive to build and cheap to query — the exact opposite of a B-tree.
Each term maps to a posting list of (doc_id, term frequency, positions). Query = look up lists, intersect/union, score, sort.
A credible search design separates into three independent layers, and most weak answers collapse them:
1. Ingestion. The primary database is the source of truth. A change-data-capture (CDC) stream or a transactional outbox publishes every change to an indexer, which upserts into the search engine. This is asynchronous on purpose — you accept a few seconds of search lag so that a slow or unavailable indexer never blocks a write. The cardinal sin here is the dual write: writing to the DB and the search engine in the same request path, which leaves them permanently inconsistent the first time one half fails.
App writes one DB transaction; CDC publishes the change; an idempotent indexer upserts into the engine. Search lag is seconds, by design.
2. Index. Sharded by doc_id (or by tenant for multi-tenant isolation), replicated for HA and read throughput. Each field gets its own analysis: free text is tokenised and stemmed, identifiers are stored as un-analysed keyword, numbers as numeric fields for range and sort. Analyzer choice is schema design — the wrong analyzer produces either irrelevant results or zero matches.
3. Query / re-rank. A coordinator fans the query out to every shard, each returns its local top-K, and the coordinator merges them into a global top-K. Business signals — popularity, recency, personalisation — are layered on top of BM25 via function_score or a separate re-ranking phase. You boost BM25; you never replace it.
A coordinator fans the query to every shard, each returns its local top-K, the coordinator merges to the global top-K.
Rebuild-ability is non-optional. An analyzer change, a schema change, or a bug will force a full reindex. Because the DB is truth, that rebuild is routine: use the alias pattern — the products alias points at products-v4; build products-v5 in the background; validate it; flip the alias atomically; keep v4 for rollback. This is the search equivalent of a blue/green deploy.
Build v5 in the background, validate, flip the alias atomically, keep v4 for rollback. The search equivalent of blue/green.
The single sentence that frames the whole pattern: the search index is a derived view, never the source of truth. Everything else — async ingestion, idempotent upserts, the rebuild path — follows from treating it that way.
Interview walkthrough
Worked example: product search for 10M products with facets, ranking, and near-real-time updates
The primary DB is the source of truth; a CDC pipeline keeps a derived inverted index fresh; a query service fans out, scores, and merges.
Step 1 — establish truth and derivation. The product database stays the source of truth. Search is a derived index that can always be rebuilt from it. This single decision unlocks async ingestion, cheap recovery, and routine reindexing.
Step 2 — pick the engine honestly. Ten million docs with faceted filtering and custom ranking is past the comfortable ceiling of Postgres FTS, so a dedicated engine (Elasticsearch/OpenSearch) is justified. If facets and custom scoring weren't required, I'd start with Postgres FTS and avoid operating a cluster.
Step 3 — build the ingestion pipeline. App writes a product in one DB transaction. Debezium tails the WAL and publishes the change to Kafka. An idempotent indexer consumes it and upserts the document by product_id. No dual write; at-least-once delivery is safe because upserts are idempotent.
App writes one DB transaction; CDC publishes the change; an idempotent indexer upserts into the engine. Search lag is seconds, by design.
Step 4 — model the fields. name/description: standard analyzer + English stemming (text, scored). category/brand/sku: keyword (exact filter + facet). price/rating: numeric (range filter + sort). Autocomplete gets its own edge-ngram suggester.
Step 5 — design the query. multi_match across name (boosted ×3) and description for BM25 text relevance; category and price as filter clauses (narrow the set, don't score, cacheable); function_score multiplies BM25 by log1p(sales) popularity and an exponential recency decay. Two-phase at scale: BM25 retrieves the top ~1000, a re-ranker applies personalisation to those.
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
Step 6 — pagination. search_after cursors so page 1000 costs the same as page 1; no deep from/size offsets.
Step 7 — the rebuild path. When the analyzer or mapping changes: build products-v5, bulk reindex from the DB, validate relevance against a judged query set, flip the products alias atomically, keep v4 for a day for rollback.
Build v5 in the background, validate, flip the alias atomically, keep v4 for rollback. The search equivalent of blue/green.
Step 8 — operate it. Monitor CDC/indexer lag as a first-class metric; fall back to the DB for read-your-write on a freshly-created product; gate every relevance change behind offline NDCG evaluation and an online A/B test on click-through and conversion.
Result. Fresh-within-seconds search, tunable relevance, faceted filtering, zero-downtime schema changes, and a search tier that scales independently of the transactional database — with a clean recovery story because the index is always rebuildable from truth.
Interview playbook
When it comes up
- The prompt says search, full-text, relevance, filters, facets, autocomplete, or ranking
- Users need more than exact-key lookup over a large corpus
- The corpus is large enough that SQL LIKE is not credible
- A "find products / posts / documents matching…" requirement appears
Order of reveal
- 1Separate truth from index. The DB is the source of truth; the search index is a derived view I can always rebuild.
- 2Describe ingestion. CDC or a transactional outbox feeds an idempotent indexer — never a dual write — so indexing is async and decoupled.
- 3Model the fields. Text fields get analyzers and stemming, identifiers are keyword for exact filter, numbers are numeric — and filters are separate from scoring.
- 4Explain ranking. BM25 base, function_score boosts for popularity and recency, two-phase re-rank for the heavy signals.
- 5Give the rebuild plan. Versioned indexes behind an alias — build, validate, flip, keep the old one for rollback.
- 6Mention scale escape hatches. search_after for pagination, sharding for size, hybrid vector retrieval if vocabulary mismatch hurts.
Signature phrases
- “Search is a derived view” — Prevents the fatal mistake of treating the engine as truth.
- “Analyzer choice is schema design” — Shows you know where relevance silently breaks.
- “Filters narrow, scores rank” — Demonstrates the filter-vs-score split most candidates miss.
- “BM25 base, boost don't replace” — Signals real relevance understanding over hand-waving.
- “The alias flip is the migration plan” — Names the zero-downtime, rollback-able rebuild.
- “Never dual-write” — Identifies the most common ingestion failure up front.
Likely follow-ups
?“What if search results lag DB writes?”Reveal
That lag is inherent to async ingestion and usually a few seconds — I monitor CDC/indexer lag as a first-class metric. For the read-your-write case (a user searching for the item they just created), I fall back to a DB lookup for that specific item, or optimistically merge the just-written doc into the result set. I state an acceptable freshness SLA rather than pretending it's instant.
?“How do you change relevance safely?”Reveal
Like production code. Offline evaluation first — run the candidate ranking against a judged query set and compare NDCG/MRR. Then shadow traffic to compare against production silently. Then an online A/B test measured on click-through and conversion, not eyeballing. Keep the old index and old model for instant rollback, and only ship if the metrics move the right way.
?“BM25 misses obvious matches — "laptop bag" vs "notebook sleeve". What do you do?”Reveal
That's vocabulary mismatch — BM25 matches terms, not meaning. I add dense-vector (semantic) retrieval: embed documents via the same CDC pipeline into an ANN index, run BM25 and k-NN in parallel at query time, and fuse the candidate lists with Reciprocal Rank Fusion before re-ranking. The lexical half keeps exact-term precision for SKUs and names; the vector half adds semantic recall.
?“How do you handle autocomplete?”Reveal
A separate index, because it's a different problem — prefix matching, every keystroke, sub-50 ms. An edge-ngram analyzer or an in-memory trie/FST, ranked by popularity rather than BM25 because users want the common completion. I keep the suggestion corpus small and curated so it stays in memory.
?“How do you paginate to page 1000?”Reveal
I don't offer arbitrary jump-to-page at depth — deep from/size offsets force every shard to over-return and the coordinator to sort an offset-proportional pile. I use search_after cursor pagination: pass the sort values of the last result as the start of the next page, so every page costs the same. "Next page" cursors, not "page 1000" offsets.
Canonical examples
- →Product / catalogue search
- →Wiki and documentation search
- →Log exploration UI (Kibana / OpenSearch Dashboards)
- →Feed / hashtag search
- →Support-ticket and email search
Variants
Postgres full-text search
tsvector + GIN index inside your existing database — no new system to operate.
Each term maps to a posting list of (doc_id, term frequency, positions). Query = look up lists, intersect/union, score, sort.
For corpora up to roughly 10M documents and modest QPS, Postgres full-text search is often the right answer, and reaching for Elasticsearch is premature complexity. You add a tsvector column, build a GIN index on it, and query with @@ and ts_rank. You get stemming, stop words, and ranking without operating a second datastore, and — critically — your search data is already consistent with your truth because it lives in the same transaction.
The ceiling: Postgres FTS ranking is less sophisticated than BM25 + function_score, faceted aggregations are clunky, and it doesn't shard search the way a dedicated engine does. When relevance tuning, facets, or scale become the bottleneck, graduate to a real engine — but start here when you can.
Pros
- +No new system — search is transactionally consistent with truth
- +Good enough relevance for many catalogues and document sets
- +One fewer thing to operate, secure, and keep in sync
Cons
- −Weaker ranking and faceting than a dedicated engine
- −Search load competes with transactional load on the same DB
- −Limited horizontal scaling of the search workload
Choose this variant when
- Up to ~10M docs and modest search QPS
- You want to avoid operating a separate search cluster
- Strong read-your-write consistency matters for search
Dedicated engine (Elasticsearch / OpenSearch)
Sharded inverted index with BM25, facets, and custom scoring for billions of docs.
A coordinator fans the query to every shard, each returns its local top-K, the coordinator merges to the global top-K.
The workhorse for large-scale faceted search. The index is sharded by doc_id and replicated; queries scatter-gather across shards; BM25 plus function_score handles relevance; aggregations power facets and analytics. Ingestion comes from CDC/outbox into an idempotent indexer.
This is the right tool when you need rich aggregations, custom scoring, billions of documents, or log/metric exploration. The cost is real operational weight — cluster management, shard sizing, JVM tuning, snapshot/restore — so don't reach for it before the scale or feature set demands it. Meilisearch and Typesense are lighter-weight alternatives with excellent developer experience at moderate scale.
Pros
- +Scales to billions of docs with sharding + replicas
- +Rich facets, aggregations, and custom scoring
- +Mature ecosystem (Kibana, snapshots, reindex API)
Cons
- −Heavy to operate — sizing, JVM, cluster health
- −Easy to misuse as a source of truth and lose data
- −Relevance tuning is its own ongoing discipline
Choose this variant when
- Billions of docs, facets, or custom scoring
- Log / metric exploration at scale
- Search is a core product surface needing tuning
Hybrid lexical + vector search
BM25 for precision + dense vectors for semantic recall, fused into one ranking.
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
Modern relevance increasingly combines lexical retrieval (BM25 — great at exact terms, names, SKUs) with semantic retrieval (dense embeddings + approximate nearest neighbour — great at synonyms and intent). You run both retrievers, then fuse the candidate lists — commonly with Reciprocal Rank Fusion (RRF) — before re-ranking.
This solves the classic BM25 failure where a user searches "laptop bag" and misses a product titled "notebook sleeve." The embeddings capture that they mean the same thing; BM25 alone would not. Engines like Vespa, OpenSearch, and Elasticsearch now support this natively with a knn query alongside the lexical query. The cost is an embedding pipeline (the same CDC stream feeds an embedding model) and a vector index alongside the inverted index.
Pros
- +Catches synonym / intent matches BM25 misses
- +Lexical half keeps exact-term precision (SKUs, names)
- +Fusion (RRF) is simple and robust to tune
Cons
- −Embedding pipeline + vector index to build and keep fresh
- −Harder to explain why a result ranked where it did
- −Embedding model drift needs periodic re-embedding
Choose this variant when
- Natural-language queries with synonyms / intent
- BM25 alone leaves obvious good results unmatched
- You already have an embedding pipeline available
Autocomplete / type-ahead
A separate prefix index (edge-ngram or FST/trie) for sub-50 ms suggestions.
Prefix/edge-ngram or an in-memory trie/FST serves type-ahead in <50 ms — separate from the main full-text index.
Type-ahead is a different problem from full-text search and deserves a different index. Every keystroke fires a query, so it must return in tens of milliseconds, and it matches prefixes rather than whole terms. Two common implementations: an edge-ngram analyzer in the search engine (index "head" as h, he, hea, head) or an in-memory trie / finite-state transducer (FST) for the hottest suggestions.
Rank suggestions by popularity, not BM25 — users want the common completion, not the most textually relevant. Keep the suggestion corpus small and curated (top queries, product names) so it stays in memory and stays fast.
Pros
- +Sub-50 ms suggestions on every keystroke
- +Popularity ranking matches user intent for completions
- +Small in-memory index — cheap and fast
Cons
- −A second index to build and keep fresh
- −Prefix matching ≠ full-text relevance — don't reuse it for search
Choose this variant when
- Search box needs type-ahead suggestions
- You can curate a bounded suggestion corpus
Scaling path
v1 — Postgres LIKE / ILIKE
Ship search on day one with zero new infrastructure.
SELECT * FROM products WHERE name ILIKE '%term%'. It works, it's one line, and for a few thousand rows it's genuinely fine. Be honest that this is a starting point, not a destination.
The fatal flaw is that %term% can't use a B-tree index — the leading wildcard forces a full table scan. Latency grows linearly with the corpus, and it has no relevance ranking, no stemming, and no multi-term scoring. At hundreds of thousands of rows it falls over.
What triggers the next iteration
- Leading-wildcard LIKE forces a full table scan
- No ranking, stemming, or multi-term relevance
- Latency grows linearly with corpus size
v2 — Postgres full-text search (tsvector + GIN)
Real relevance and index-backed queries without a new system.
Add a tsvector column and a GIN index, query with @@ and ts_rank. Now you have stemming, stop words, multi-term ranking, and queries that hit an index instead of scanning. Search stays transactionally consistent with truth because it lives in the same database.
Each term maps to a posting list of (doc_id, term frequency, positions). Query = look up lists, intersect/union, score, sort.
This serves a great many products well past launch. The pressure that pushes you off it: search load competing with transactional load, weak faceting, and ranking that you can't tune as precisely as you'd like.
What triggers the next iteration
- Search load competes with OLTP load on the same DB
- Faceted aggregations are awkward
- Ranking less tunable than BM25 + function_score
v3 — dedicated engine fed by CDC
Decouple search from the DB and get real ranking + facets.
Stand up Elasticsearch/OpenSearch. The primary DB stays truth; a CDC stream (Debezium) or transactional outbox publishes changes to Kafka; an idempotent indexer upserts documents into the engine. Now search scales independently of the OLTP database, you get BM25 + function_score + aggregations, and a slow indexer never blocks a write.
App writes one DB transaction; CDC publishes the change; an idempotent indexer upserts into the engine. Search lag is seconds, by design.
The new responsibilities: monitoring CDC/indexer lag, making upserts idempotent (so replays don't corrupt the index), and — from day one — an alias-based rebuild path.
What triggers the next iteration
- Search lag — index trails the DB by CDC latency
- Indexer must be idempotent to survive replays
- A new cluster to size, monitor, and keep healthy
v4 — two-phase ranking + reindex discipline + autocomplete
Tune relevance safely and operate at scale.
Add a two-phase ranking pipeline — BM25 retrieves a cheap candidate set (top ~1000), then a heavier re-ranker applies popularity, recency, and personalisation (or an ML model) to just those candidates. Add a separate autocomplete index for type-ahead. Make reindex routine with versioned indexes and alias flips, and gate every relevance change behind offline evaluation and an A/B test.
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
At this stage relevance is a continuous discipline: offline metrics (NDCG), shadow queries, A/B tests on click-through and conversion, and the alias pattern as your safety net for rollback.
What triggers the next iteration
- Per-query personalisation can blow cache and latency
- Relevance changes need offline eval + A/B gating
- Multiple indexes (search + autocomplete + vectors) to keep fresh
Deep dives
Analyzer design is why search feels good or broken
Each term maps to a posting list of (doc_id, term frequency, positions). Query = look up lists, intersect/union, score, sort.
The single biggest lever on perceived relevance is per-field analysis, and it's where weak answers stop at "put it in Elasticsearch."
Each term maps to a posting list of (doc_id, term frequency, positions). Query = look up lists, intersect/union, score, sort.
The same source field is often indexed several ways for different jobs:
- Free text (
name,description): tokenise, lowercase, stem ("running" → "run"), drop stop words. This is what BM25 scores against. - Keyword (
category,sku,brand): stored un-analysed for exact-match filtering and aggregations. If you tokenise a SKU, exact lookups break. - Numeric (
price,rating): numeric fields for range filters and sorting. - Edge-ngram (autocomplete): index prefixes for type-ahead.
The most important distinction in the query is filter vs score. A category filter or price range is a must clause that narrows the candidate set but does not affect the relevance score — wrapping it in the scoring query both slows things down and pollutes ranking. Filters are also cacheable; scoring clauses are not. Getting this split right is the difference between "search feels precise" and "search returns garbage."
A field with the wrong analyzer fails silently: a search for "T-Shirt" returns nothing because the field was tokenised as "t" and "shirt" but queried as a keyword, or a stemmer collapses "apple" and "apples" when you needed them distinct. Always state which fields are text, which are keyword, and which are numeric.
Ingestion: never dual-write, always CDC or outbox
App writes one DB transaction; CDC publishes the change; an idempotent indexer upserts into the engine. Search lag is seconds, by design.
The most common production failure in search is a dual write — the application writes the DB and the search engine in the same request handler. It works in the demo and breaks the first time the second write fails: now the DB says one thing and the index says another, permanently, because there's no mechanism to reconcile them.
App writes one DB transaction; CDC publishes the change; an idempotent indexer upserts into the engine. Search lag is seconds, by design.
The fix is to make the DB write the only synchronous write and derive the index from it:
- Transactional outbox: the app writes the business row and an outbox row in one transaction; a relay reads the outbox and publishes to Kafka. Atomic with the business write, no extra infrastructure on the write path.
- CDC (Debezium): tail the database's replication log and publish row changes directly. No app changes at all, but you couple to the DB's log format.
Either way, the indexer is an idempotent consumer: it upserts by doc_id, so replaying the same event twice (which will happen — Kafka is at-least-once) produces the same index state. The price is search lag: the index trails the DB by the pipeline latency, usually seconds. You accept that, monitor it as a first-class metric, and for the rare read-your-write case (a user searching for the thing they just created) you fall back to the DB for that specific item.
Relevance: BM25 base, two-phase re-rank, A/B everything
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
Relevance is where search stops being a database problem and becomes a product problem.
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
BM25 is the base. It scores a document by how often the query terms appear in it (term frequency), damped so the tenth occurrence matters less than the first, and weighted by how rare each term is across the corpus (inverse document frequency), with a normalisation for document length so long documents don't win by sheer size. It's a strong, parameter-light default — do not throw it away.
Layer business signals on top, never instead. function_score multiplies the BM25 score by boosts: log1p(sales) for popularity, an exponential decay on age for recency, a category boost for merchandising. The mental model is "BM25 ranks by textual relevance; boosts nudge toward business relevance." When the two fight — a textually perfect match for a discontinued product — you tune the weights, you don't discard BM25.
Two-phase retrieval keeps it fast. Expensive ranking (ML models, personalisation) can't run over the whole corpus. So phase one uses cheap BM25 to retrieve a candidate set (top ~1000), and phase two re-ranks only those candidates with the heavy signals. This bounds the expensive work to a fixed K regardless of corpus size.
Change relevance like you change production code. Every ranking change goes through offline evaluation (NDCG/MRR against a judged query set), then shadow traffic, then an A/B test measured on click-through and conversion — never "it looked better on my three test queries." Keep the old index and old model for instant rollback.
Every serious search system needs a rebuild path
Build v5 in the background, validate, flip the alias atomically, keep v4 for rollback. The search equivalent of blue/green.
Analyzer changes, mapping changes, and new fields frequently require a full reindex — you can't retroactively re-analyse documents already in the index. If the search engine is the only copy of your data, you're stuck. Because the DB is truth, rebuild is routine.
Build v5 in the background, validate, flip the alias atomically, keep v4 for rollback. The search equivalent of blue/green.
The mechanism is versioned indexes behind an alias:
- 1The application only ever queries the alias
products, which currently points atproducts-v4. - 2Create
products-v5with the new mapping/analyzer. - 3Bulk reindex from the source DB (or the engine's reindex API) into v5.
- 4Validate — run your judged query set against v5, compare relevance metrics to v4.
- 5Atomically flip the alias from v4 to v5 (a single metadata operation — no downtime, no half-state).
- 6Keep v4 around for a day for instant rollback; delete it once confident.
This is blue/green for search. The discipline it enforces — that the index can always be rebuilt from truth — is also your disaster-recovery story: if the cluster is lost or corrupted, you rebuild from the DB rather than restoring a possibly-stale snapshot.
Sharding, fan-out, and the deep-pagination trap
A coordinator fans the query to every shard, each returns its local top-K, the coordinator merges to the global top-K.
At scale the index is sharded, and a query becomes a scatter-gather: the coordinator sends the query to every shard, each computes its local top-K, and the coordinator merges them into the global top-K.
A coordinator fans the query to every shard, each returns its local top-K, the coordinator merges to the global top-K.
Two subtleties matter in interviews:
Local-to-global correctness for ranking. Each shard must return enough candidates (and their scores) for the merge to be correct. Because IDF — the "how rare is this term" component of BM25 — is computed per shard by default, scores can differ slightly across shards for the same document; engines offer a global-IDF mode when consistent scoring matters, at the cost of an extra round trip.
Deep pagination is a tarpit. To return results 10,000–10,010, a naive fan-out makes every shard return its top 10,010 and the coordinator sorts 10,010 × shard_count documents to discard all but ten. Cost grows with the page offset. The fix is cursor / search-after pagination: pass the sort values of the last result as the starting point for the next page, so each page costs the same regardless of depth. Offer "next page" cursors, not arbitrary "jump to page 1000" offsets.
Shard sizing is the other operational lever: too many small shards waste coordination overhead, too few large shards limit parallelism and make rebuilds slow. Size shards to a few tens of GB and scale by adding shards (and replicas for read throughput) as the corpus grows.
Lexical + vector: when keywords aren't enough
BM25 retrieves a cheap candidate set; a heavier model (or function_score) re-ranks just the top-K with business and personalisation signals.
BM25 matches terms. It cannot know that "laptop bag" and "notebook sleeve" mean the same thing, or that a query for "comfortable running shoes" should surface a product described as "cushioned trainers." That's the vocabulary mismatch problem, and it's where dense-vector (semantic) retrieval earns its place.
The hybrid recipe:
- 1Embed documents (the same CDC stream feeds an embedding model) into dense vectors stored in an ANN index (HNSW).
- 2At query time, run both retrievers — BM25 over the inverted index and k-NN over the vector index.
- 3Fuse the two candidate lists. Reciprocal Rank Fusion (RRF) is the robust default: a document's fused score sums
1 / (k + rank)from each list, so it rewards documents that rank well in either retriever without needing the two score scales to be comparable. - 4Re-rank the fused candidates as usual.
The lexical half keeps exact-term precision (a search for a specific SKU or model number still works); the vector half adds semantic recall (synonyms, intent, paraphrase). The costs are an embedding pipeline to build and keep fresh, a vector index alongside the inverted index, and reduced explainability — it's harder to say why a semantically-retrieved result ranked where it did. Reach for hybrid when you can see BM25 leaving obviously-good results unmatched; don't add it speculatively.
Decision levers
Engine choice
Postgres FTS up to ~10M docs and modest QPS — no new system, transactionally consistent. Meilisearch / Typesense for great DX at moderate scale. Elasticsearch / OpenSearch for billions + facets + custom scoring. Vespa when search is the product and you need first-class ML re-ranking.
Ingestion pipeline
Never dual-write. Transactional outbox (atomic with the business write) or CDC via Debezium. Publish to Kafka; an idempotent indexer upserts by doc_id so at-least-once replays are safe. Accept seconds of search lag and monitor it.
Analyzer per field
Free text → standard + language stemming. Identifiers (SKU, category) → keyword, un-analysed, exact match. Numbers → numeric for range + sort. Autocomplete → edge-ngram. Separate filter clauses (cacheable, non-scoring) from scoring clauses.
Relevance strategy
BM25 base, never replaced. function_score multiplies in popularity (log1p) and recency decay. Two-phase: cheap retrieval → expensive re-rank over the top-K. Gate every change behind offline NDCG eval + an online A/B test on CTR / conversion.
Rebuild path
Versioned indexes behind an alias. Build vN+1 in the background, validate against a judged query set, flip the alias atomically, keep vN for rollback. This is also your DR story — rebuild from truth.
Lexical vs hybrid
BM25 alone for exact-term corpora (codes, names). Add dense-vector retrieval fused with RRF when vocabulary mismatch (synonyms, intent) leaves good results unmatched. Costs an embedding pipeline and a vector index.
Failure modes
Treating the search engine as the only copy of data. Engines crash and corrupt; treat the index as a derived view and rebuild from the DB. If you can't rebuild, you can't recover.
Writing DB and search engine in the same handler. The first failed second-write leaves them permanently inconsistent. Use CDC or a transactional outbox so the DB write is the only synchronous write.
Coupling write latency and availability to the indexer. Make indexing async; accept a few seconds of search lag and fall back to the DB for read-your-write cases.
A leading-wildcard LIKE forces a full table scan and dies at hundreds of thousands of rows, with no ranking. Step up to Postgres FTS, then a real engine.
Tokenising a SKU, or keyword-matching free text, produces silently empty or irrelevant results. Choose analyzers per field and separate filter clauses from scoring clauses.
A schema or analyzer change with no alias + reindex pattern means downtime or a stuck migration. Plan the versioned-index alias flip from day one.
Large offsets force every shard to over-return and the coordinator to sort an offset-proportional pile. Use search_after cursor pagination; offer "next page", not "jump to page 1000".
Re-ranking every result per user blows cache hit rate and latency. Personalise at the top-K re-rank stage, not during retrieval/scan.
Case studies
Algolia
Algolia — relevance as a product, ranking as a tie-break ladder
Algolia built a business on the premise that relevance and speed are the product, and their public engineering writing is a masterclass in the ranking layer most candidates hand-wave.
Rather than a single opaque score, Algolia uses a tie-breaking ranking formula: a configurable, ordered list of criteria (number of matched words, proximity of the matched words, attribute importance, exactness, then custom business ranking like popularity) applied in sequence. Two results that tie on the first criterion are decided by the next, and so on. This is deliberately explainable — you can point at exactly which criterion ranked one product above another, which makes relevance tuning a conversation with product managers rather than a black box.
They also push hard on the type-as-you-search experience, treating sub-50 ms response as a hard requirement because every keystroke is a query. That forces the separation of concerns this pattern describes: a prefix-optimised index for instant suggestions, distinct from the main searchable attributes, with results ranked by popularity for the common-completion intent.
The takeaway for an interview: don't describe relevance as "BM25 and done." Describe an ordered, tunable, explainable ranking pipeline, and treat latency as a first-class constraint that shapes the index design.
Elastic
Elasticsearch — BM25, scatter-gather, and the search-after cursor
Elasticsearch (and its OpenSearch fork) is the de-facto reference implementation of this pattern, and its design choices are worth citing directly.
It switched its default similarity from the classic TF-IDF to BM25 because BM25's term-frequency saturation and length normalisation produce better relevance out of the box — a concrete reminder that "use BM25" is the modern default, not TF-IDF. Relevance is then layered with function_score for popularity and recency boosts and bool queries that cleanly separate filter context (cacheable, non-scoring) from must context (scoring).
Its query execution is the canonical scatter-gather: the coordinating node fans the query to a copy of each shard, each shard returns its local top-K with document IDs and scores, and the coordinator merges them. Crucially, Elastic documents the deep-pagination problem explicitly and steers users away from large from/size offsets toward search_after (cursor) pagination and the scroll/PIT APIs for deep traversal — exactly because the naive approach forces every shard to over-return and the coordinator to sort an offset-proportional pile of documents.
And the alias-based reindex is first-class: the reindex API plus index aliases give you the zero-downtime, rollback-able rebuild that the pattern insists on.
Etsy
Etsy — two-phase retrieval and learning-to-rank on the candidate set
Etsy's search serves a huge, long-tail marketplace where pure text relevance is far from enough — buyers care about quality, shipping, price, and recency, not just keyword overlap. Etsy's engineering team has written extensively about moving from hand-tuned scoring to machine-learned ranking (learning-to-rank), and the architecture is the textbook two-phase shape.
Phase one is candidate retrieval: a cheap lexical query (inverted index, BM25-style) pulls a few hundred to a few thousand plausible listings for the query. Running an expensive model over the entire corpus would be impossibly slow, so retrieval's only job is high recall at low cost.
Phase two is re-ranking: a learned model scores just those candidates using rich features — historical click-through and purchase rates, listing quality signals, price competitiveness, recency, and personalisation. Because it runs over a bounded candidate set, the expensive model's cost is independent of corpus size.
Etsy gates ranking changes behind offline evaluation and online A/B tests measured on engagement and conversion, not eyeballing — the discipline this pattern prescribes. The result is the standard production blueprint: cheap recall-oriented retrieval feeding an expensive precision-oriented re-ranker, with every change measured.
Decision table
Search quality comes from ingestion freshness, analyzer choices, and ranking discipline.
| Decision | Default | Trade-off | Robust answer includes |
|---|---|---|---|
| Source of truth | Primary DB; index is derived | Index can always be rebuilt | CDC / outbox and idempotent upserts |
| Engine | Postgres FTS < 10M, else ES/OpenSearch | Operate a second system for scale + facets | Named scale threshold and an alternative |
| Analyzer | Per-field (text / keyword / numeric) | Wrong analyzer → bad relevance or zero matches | Filter-vs-score split named explicitly |
| Ranking | BM25 base + function_score boosts | Business relevance can fight text relevance | Two-phase re-rank, offline eval, A/B test |
| Reindex | Versioned index + alias flip | Background cost + temporary duplicate storage | Validation and rollback plan |
| Pagination | search_after cursor | No arbitrary jump-to-page | Why deep from/size offsets are a tarpit |
- Filters are cacheable and non-scoring; scoring clauses are neither — keep them separate.
- Search lag is a feature of async ingestion; monitor it and fall back to the DB for read-your-write.
Drills
Explain an inverted index in 30 seconds.Reveal
For every term, store a posting list of (doc_id, term_frequency, positions). A search looks up each term's list, intersects (AND) or unions (OR) the doc_ids, scores the survivors with BM25, sorts, and returns the top K. It's expensive to build and cheap to query — the opposite of a B-tree, which is why it's a separate index from your primary key.
Why is `LIKE '%term%'` not search?Reveal
The leading wildcard can't use a B-tree index, so every query is a full table scan whose cost grows linearly with the corpus — it dies at hundreds of thousands of rows. It also has no relevance ranking, no stemming, and no multi-term scoring. Search needs an inverted index and BM25, which is a fundamentally different data structure.
You changed the analyzer. How do you deploy it?Reveal
The alias pattern. Create products-v5 with the new analyzer, bulk reindex from the source DB (the index is a derived view, so this is routine), validate relevance against a judged query set, then atomically flip the products alias from v4 to v5 — a single metadata op with no downtime. Keep v4 for a day for rollback, then delete it.
Why must ingestion be async, and what's the cost?Reveal
Synchronous indexing couples write latency and availability to the search engine — a slow or down indexer would block business writes. Async (CDC/outbox → idempotent indexer) decouples them. The cost is search lag: the index trails the DB by the pipeline latency, usually seconds. You monitor that lag and fall back to the DB for the rare read-your-write case.
What's the difference between a filter clause and a scoring clause?Reveal
A filter (category, price range) narrows the candidate set with a yes/no test and does not affect the relevance score — and because it's boolean it's cacheable. A scoring clause (the text match, the boosts) computes how relevant each surviving document is. Mixing them slows queries and pollutes ranking; keep filters in the filter context and scoring in must.
BM25 returns nothing for "laptop bag" but a "notebook sleeve" exists. Fix it.Reveal
That's vocabulary mismatch — BM25 matches terms, not meaning. Add semantic retrieval: embed documents into dense vectors via the same CDC pipeline, store them in an ANN index, and at query time run BM25 and k-NN in parallel, fusing the candidate lists with Reciprocal Rank Fusion before re-ranking. The lexical half keeps exact-term precision; the vector half adds synonym/intent recall.
When to reach for this