Search indexing
Inverted indexes, analyzers, ranking, reindexing, freshness, and search cluster shape.
Full-text search is an inverted index plus relevance; the hard part is relevance. LIKE '%foo%' on a SQL table is not search — it's a table scan waiting to die.
Read this if your last attempt…
- You said "we'll search with LIKE" and moved on
- You don't know what an inverted index is
- You can't explain BM25 or TF-IDF
- You haven't thought about index-rebuild flows
The concept
A search engine (Elasticsearch / OpenSearch / Solr / Meilisearch / Vespa) is built on an inverted index: for every term, a posting list of (doc_id, positions, frequency). A search for "quick brown fox" becomes: look up each term's posting list, intersect/union doc_ids, score each hit, sort by score, return top K.
Analyzers turn raw text into terms:
- Tokenise (split on whitespace/punctuation).
- Lowercase.
- Remove stop words (optional; "the", "a").
- Stem or lemmatise ("running" → "run").
- Language-specific (Chinese word-segmentation, Japanese, accent folding).
DB is the truth. CDC populates the index. Queries fan out across shards, merge, re-rank.
Search engine trade-offs.
| Option | Scale | Relevance knobs | Ops cost |
|---|---|---|---|
| SQL LIKE / ILIKE | Up to millions of rows | Effectively none | Zero (you already have the DB) |
| Postgres full-text (tsvector) | Millions of rows | Decent | Low — it's a Postgres feature |
| Meilisearch / Typesense | 10M docs | Good, typo-tolerance built in | Low — single binary |
| Elasticsearch / OpenSearch | Billions | Full BM25, aggregations, custom scoring | Non-trivial — cluster operations |
| Vespa / Solr | Billions + ML re-rank | Advanced | High; for teams with search as core product |
How interviewers grade this
- You name the search engine (Elasticsearch / Meilisearch / etc.) and don't reach for LIKE.
- You state the analyzer choice (language, stemming, stop words).
- You pick BM25 as the default relevance and name the re-rank signals.
- Your index is populated from CDC; the DB is the source of truth.
- You have a rebuild plan for schema/analyzer changes.
Variants
Postgres full-text search
Use tsvector + GIN index. No extra infrastructure.
Excellent fit up to single-digit millions of rows. Supports ranking (ts_rank), language-aware stemming. The "right answer" for small and medium products — avoids a whole new system.
Pros
- +No extra infrastructure
- +Transactional consistency with the rest of Postgres
- +Language support built in
- +Works at modest scale
Cons
- −Maxes out around 10M docs or heavy query volume
- −Ranking is BM25-ish but less tunable
- −No typo tolerance without pg_trgm
Choose this variant when
- Up to ~10M docs
- Existing Postgres
- Don't want the ops of Elasticsearch
Elasticsearch / OpenSearch
The industrial-scale full-text engine.
Sharded, replicated, BM25 by default, rich query DSL, aggregations. Expensive to run (cluster ops, JVM tuning). Right choice when scale exceeds Postgres FTS or you need faceted search.
Pros
- +Scales to billions
- +Rich query DSL (nested, phrase, fuzzy)
- +Faceted search / aggregations
- +Custom scoring via function_score
Cons
- −Expensive cluster ops
- −Memory hungry
- −Not a system-of-record — rebuild-ability is non-optional
Choose this variant when
- Large-scale search
- Faceted search
- Need custom relevance
Meilisearch / Typesense
Batteries-included modern search engines.
Typo tolerance built in, instant search (sub-50 ms), simple API. Great DX. Scales to tens of millions of docs comfortably. Less flexible than Elasticsearch for complex queries or huge scale.
Pros
- +Excellent DX
- +Typo tolerance default-on
- +Fast queries on moderate data sizes
Cons
- −Less tunable than Elasticsearch
- −Smaller ecosystem
Choose this variant when
- Product search
- Site search
- Developer experience matters and data size is moderate
Worked example
Design: product search for an e-commerce site, 10M products.
Engine: Elasticsearch (could be Meilisearch if complexity isn't needed).
Pipeline:
- Products table in Postgres is source of truth.
- Debezium CDC publishes
products.change→ Kafka → Elasticsearch indexer consumer. - Indexer does a partial update on the Elasticsearch doc.
Index design:
- Field
name: standard analyzer + shingles for phrase matching. - Field
description: English analyzer with stemming. - Field
categories: keyword (not analyzed — exact match for filtering). - Field
price,rating,popularity: numeric fields for sorting / boosting.
Query:
- User types "running shoes" → multi_match on name^3 + description.
- BM25 gives base score.
- function_score boosts by popularity (log1p(sales_30d)) and recency (newer products slightly boosted).
- Filters (categories, price range, size) applied as must-clauses — don't affect score, just restrict the result set.
Operational:
- Alias pattern:
productsalias points atproducts-v3. Rebuild intoproducts-v4in background; flip the alias atomically. Old index kept for quick rollback. - Index rebuild after an analyzer change takes hours for 10M docs; done in the background from the CDC + snapshot.
Good vs bad answer
Interviewer probe
“How do you implement search over 10M products?”
Weak answer
"SQL query with ILIKE '%term%'."
Strong answer
"Elasticsearch with BM25 as base relevance. Postgres is source of truth; Debezium CDC populates the index. Fields are analysed appropriately — standard analyzer on product names, English-stemming on descriptions, keyword (un-analysed) on categories for exact filtering. Queries are multi_match with field boosts; function_score re-rank by popularity and recency. Filters (price, category, size) apply as must-clauses without affecting score. For rebuilds (analyzer changes, schema changes), I use the alias pattern: build products-v4, flip the products alias, keep v3 for rollback. At 10M docs it's a few hours background rebuild. If complexity were lower I'd consider Meilisearch — same scale, easier ops. LIKE on the primary DB is not search; it's a full table scan that dies at a few hundred thousand rows."
Why it wins: Names engine, pipeline, field analysis, relevance strategy, rebuild pattern, and the wrong alternatives.
When it comes up
- A feature needs full-text search or autocomplete
- The interviewer says "users search for products / posts / messages"
- Someone proposes a LIKE "%term%" scan on the primary DB
- Faceted filtering or relevance ranking is in scope
Order of reveal
- 11. Reject LIKE. A LIKE "%x%" scan reads the whole table and dies at a few hundred thousand rows. Search means an inverted index, so I name a real engine.
- 22. Right-size the engine. Under ~10M docs, Postgres full-text (tsvector + GIN) avoids a new system. Beyond that, or with facets, Elasticsearch / OpenSearch.
- 33. DB stays the truth. The primary DB is the source of truth; the index is a derived view populated via CDC — never the only copy.
- 44. Relevance. BM25 for base relevance, then function_score to boost by popularity and recency so results are not just textually correct but useful.
- 55. Reindex safely. Build into a new index and flip an alias atomically, keeping the old one for one-step rollback after analyzer or schema changes.
Signature phrases
- “LIKE '%x%' is a table scan, not search.” — The line that shows you know why a search engine exists.
- “The DB is the source of truth; the index is a derived view via CDC.” — Captures the correct architecture in one sentence.
- “BM25 for base relevance, then boost by popularity and recency.” — Shows you separate textual match from business ranking.
- “Reindex behind an alias so the flip is atomic and rollback is one step.” — Operational maturity around the routine reindex.
Likely follow-ups
?“When is Postgres full-text search actually enough?”Reveal
Up to roughly 10M documents with modest query volume and no heavy faceting, Postgres FTS (a tsvector column + GIN index, ranked with ts_rank) is genuinely sufficient — and it keeps search transactionally consistent with the rest of your data and avoids running a whole separate cluster. You graduate to Elasticsearch when scale, facets/aggregations, or custom relevance tuning exceed what FTS offers. Reaching for Elasticsearch on a 100k-row table is over-engineering.
?“Results are textually correct but the popular items do not rank first. What do you do?”Reveal
Layer business signals on top of BM25 rather than replacing it. In Elasticsearch, wrap the query in function_score and multiply the relevance by a popularity factor like log1p(sales_30d) and a recency decay like exp(-age/decay). Then validate with A/B tests on click-through and conversion — ranking is an empirical loop, not a one-shot formula.
?“You changed the analyzer and the existing index no longer matches. Walk me through the fix.”Reveal
Create a new index (products-v4) with the new analyzer, bulk-reindex from the source DB (or the old index via the reindex API), and verify it returns expected results. Then flip the products alias from v3 to v4 atomically so live traffic cuts over with zero downtime, and keep v3 for ~24h as a one-command rollback. This alias pattern is why search reindexes are routine rather than scary.
Common mistakes
Elasticsearch has crashed and corrupted. Treat the search index as a derived view that can be rebuilt from the source DB. Never the only copy.
Couples write latency to search indexing. Make indexing async via CDC/outbox. Accept a few seconds of search lag.
A full table scan per query. Dies at hundreds of thousands of rows. Postgres FTS is the first step up; a real search engine is the next.
Analyzer change? Schema change? Bug in indexer? You need to be able to rebuild. Plan the alias pattern from day one.
Practice drills
Explain an inverted index in 30 seconds.Reveal
For every term (word) in the corpus, store a posting list of (doc_id, term_frequency, positions). Search for "quick brown" looks up each term's list, intersects the doc_ids (AND) or unions them (OR), scores each match using BM25 or TF-IDF, sorts, returns top K. The index is expensive to build but cheap to query — the opposite of a B-tree, which is cheap to update but can't answer "which docs contain this word" efficiently.
Your search results are "correct but boring" — popular items don't rank first. What do you do?Reveal
Layer business signals on top of BM25. In Elasticsearch, use function_score to multiply base relevance by a popularity boost: score * log1p(sales_30d). Or a recency boost: score * exp(-age / decay). Don't overwrite BM25; boost it. Measure with A/B tests — click-through rate, conversion, dwell time. Iterate.
Interviewer: "you changed the analyzer. Existing index is now mismatched. What's your process?"Reveal
Create a new index with the new analyzer (e.g., products-v4). Bulk-reindex from the source DB (or Elasticsearch's reindex API from the old index). Verify the new index returns expected results. Flip the alias (products → v4) atomically. Keep v3 for ~24h for rollback. Delete v3 after confidence.
Cheat sheet
- •LIKE is not search.
- •<10M docs: Postgres FTS is often enough.
- •>10M or need facets: Elasticsearch / OpenSearch.
- •Great DX + moderate scale: Meilisearch / Typesense.
- •DB = source of truth. Index = derived view via CDC.
- •Alias pattern for atomic reindex + rollback.
- •BM25 base score + function_score for business signals.
Practice this skill
No problem is tagged directly to Search indexing yet. These published problems still exercise the same interview category.
Read this if