intermediatedeep dive

Search indexing

Inverted indexes, analyzers, ranking, reindexing, freshness, and search cluster shape.

~10 min read

Full-text search is an inverted index plus relevance; the hard part is relevance. LIKE '%foo%' on a SQL table is not search — it's a table scan waiting to die.

Read this if your last attempt…

You said "we'll search with LIKE" and moved on
You don't know what an inverted index is
You can't explain BM25 or TF-IDF
You haven't thought about index-rebuild flows

The concept

A search engine (Elasticsearch / OpenSearch / Solr / Meilisearch / Vespa) is built on an inverted index: for every term, a posting list of (doc_id, positions, frequency). A search for "quick brown fox" becomes: look up each term's posting list, intersect/union doc_ids, score each hit, sort by score, return top K.

Analyzers turn raw text into terms:
- Tokenise (split on whitespace/punctuation).
- Lowercase.
- Remove stop words (optional; "the", "a").
- Stem or lemmatise ("running" → "run").
- Language-specific (Chinese word-segmentation, Japanese, accent folding).

Architecture diagram· Inverted index + fan-out query

DB is the truth. CDC populates the index. Queries fan out across shards, merge, re-rank.

Search engine trade-offs.

Option	Scale	Relevance knobs	Ops cost
SQL LIKE / ILIKE	Up to millions of rows	Effectively none	Zero (you already have the DB)
Postgres full-text (tsvector)	Millions of rows	Decent	Low — it's a Postgres feature
Meilisearch / Typesense	10M docs	Good, typo-tolerance built in	Low — single binary
Elasticsearch / OpenSearch	Billions	Full BM25, aggregations, custom scoring	Non-trivial — cluster operations
Vespa / Solr	Billions + ML re-rank	Advanced	High; for teams with search as core product

How interviewers grade this

You name the search engine (Elasticsearch / Meilisearch / etc.) and don't reach for LIKE.
You state the analyzer choice (language, stemming, stop words).
You pick BM25 as the default relevance and name the re-rank signals.
Your index is populated from CDC; the DB is the source of truth.
You have a rebuild plan for schema/analyzer changes.

Variants

Postgres full-text search

Use tsvector + GIN index. No extra infrastructure.

Excellent fit up to single-digit millions of rows. Supports ranking (ts_rank), language-aware stemming. The "right answer" for small and medium products — avoids a whole new system.

Pros

+No extra infrastructure
+Transactional consistency with the rest of Postgres
+Language support built in
+Works at modest scale

Cons

−Maxes out around 10M docs or heavy query volume
−Ranking is BM25-ish but less tunable
−No typo tolerance without pg_trgm

Choose this variant when

Up to ~10M docs
Existing Postgres
Don't want the ops of Elasticsearch

Elasticsearch / OpenSearch

The industrial-scale full-text engine.

Sharded, replicated, BM25 by default, rich query DSL, aggregations. Expensive to run (cluster ops, JVM tuning). Right choice when scale exceeds Postgres FTS or you need faceted search.

Pros

+Scales to billions
+Rich query DSL (nested, phrase, fuzzy)
+Faceted search / aggregations
+Custom scoring via function_score

Cons

−Expensive cluster ops
−Memory hungry
−Not a system-of-record — rebuild-ability is non-optional

Choose this variant when

Large-scale search
Faceted search
Need custom relevance

Meilisearch / Typesense

Batteries-included modern search engines.

Typo tolerance built in, instant search (sub-50 ms), simple API. Great DX. Scales to tens of millions of docs comfortably. Less flexible than Elasticsearch for complex queries or huge scale.

Pros

+Excellent DX
+Typo tolerance default-on
+Fast queries on moderate data sizes

Cons

−Less tunable than Elasticsearch
−Smaller ecosystem

Choose this variant when

Product search
Site search
Developer experience matters and data size is moderate

Worked example

Design: product search for an e-commerce site, 10M products.

Engine: Elasticsearch (could be Meilisearch if complexity isn't needed).

Pipeline:

Products table in Postgres is source of truth.
Debezium CDC publishes products.change → Kafka → Elasticsearch indexer consumer.
Indexer does a partial update on the Elasticsearch doc.

Index design:

Field name: standard analyzer + shingles for phrase matching.
Field description: English analyzer with stemming.
Field categories: keyword (not analyzed — exact match for filtering).
Field price, rating, popularity: numeric fields for sorting / boosting.

Query:

User types "running shoes" → multi_match on name^3 + description.
BM25 gives base score.
function_score boosts by popularity (log1p(sales_30d)) and recency (newer products slightly boosted).
Filters (categories, price range, size) applied as must-clauses — don't affect score, just restrict the result set.

Operational:

Alias pattern: products alias points at products-v3. Rebuild into products-v4 in background; flip the alias atomically. Old index kept for quick rollback.
Index rebuild after an analyzer change takes hours for 10M docs; done in the background from the CDC + snapshot.

Good vs bad answer

Interviewer probe

“How do you implement search over 10M products?”

Weak answer

"SQL query with ILIKE '%term%'."

Strong answer

"Elasticsearch with BM25 as base relevance. Postgres is source of truth; Debezium CDC populates the index. Fields are analysed appropriately — standard analyzer on product names, English-stemming on descriptions, keyword (un-analysed) on categories for exact filtering. Queries are multi_match with field boosts; function_score re-rank by popularity and recency. Filters (price, category, size) apply as must-clauses without affecting score. For rebuilds (analyzer changes, schema changes), I use the alias pattern: build products-v4, flip the products alias, keep v3 for rollback. At 10M docs it's a few hours background rebuild. If complexity were lower I'd consider Meilisearch — same scale, easier ops. LIKE on the primary DB is not search; it's a full table scan that dies at a few hundred thousand rows."

Why it wins: Names engine, pipeline, field analysis, relevance strategy, rebuild pattern, and the wrong alternatives.

Interview playbook2–3 min when search is a first-class feature of the system

When it comes up

A feature needs full-text search or autocomplete
The interviewer says "users search for products / posts / messages"
Someone proposes a LIKE "%term%" scan on the primary DB
Faceted filtering or relevance ranking is in scope

Order of reveal

1
1. Reject LIKE. A LIKE "%x%" scan reads the whole table and dies at a few hundred thousand rows. Search means an inverted index, so I name a real engine.
2
2. Right-size the engine. Under ~10M docs, Postgres full-text (tsvector + GIN) avoids a new system. Beyond that, or with facets, Elasticsearch / OpenSearch.
3
3. DB stays the truth. The primary DB is the source of truth; the index is a derived view populated via CDC — never the only copy.
4
4. Relevance. BM25 for base relevance, then function_score to boost by popularity and recency so results are not just textually correct but useful.
5
5. Reindex safely. Build into a new index and flip an alias atomically, keeping the old one for one-step rollback after analyzer or schema changes.

Signature phrases

“LIKE '%x%' is a table scan, not search.”

“The DB is the source of truth; the index is a derived view via CDC.”

“BM25 for base relevance, then boost by popularity and recency.”

“Reindex behind an alias so the flip is atomic and rollback is one step.”

“LIKE '%x%' is a table scan, not search.” — The line that shows you know why a search engine exists.
“The DB is the source of truth; the index is a derived view via CDC.” — Captures the correct architecture in one sentence.
“BM25 for base relevance, then boost by popularity and recency.” — Shows you separate textual match from business ranking.
“Reindex behind an alias so the flip is atomic and rollback is one step.” — Operational maturity around the routine reindex.

Likely follow-ups

?“When is Postgres full-text search actually enough?”Reveal

Up to roughly 10M documents with modest query volume and no heavy faceting, Postgres FTS (a tsvector column + GIN index, ranked with ts_rank) is genuinely sufficient — and it keeps search transactionally consistent with the rest of your data and avoids running a whole separate cluster. You graduate to Elasticsearch when scale, facets/aggregations, or custom relevance tuning exceed what FTS offers. Reaching for Elasticsearch on a 100k-row table is over-engineering.

?“Results are textually correct but the popular items do not rank first. What do you do?”Reveal

Layer business signals on top of BM25 rather than replacing it. In Elasticsearch, wrap the query in function_score and multiply the relevance by a popularity factor like log1p(sales_30d) and a recency decay like exp(-age/decay). Then validate with A/B tests on click-through and conversion — ranking is an empirical loop, not a one-shot formula.

?“You changed the analyzer and the existing index no longer matches. Walk me through the fix.”Reveal

Create a new index (products-v4) with the new analyzer, bulk-reindex from the source DB (or the old index via the reindex API), and verify it returns expected results. Then flip the products alias from v3 to v4 atomically so live traffic cuts over with zero downtime, and keep v3 for ~24h as a one-command rollback. This alias pattern is why search reindexes are routine rather than scary.

Common mistakes

Search as source of truth

Elasticsearch has crashed and corrupted. Treat the search index as a derived view that can be rebuilt from the source DB. Never the only copy.

Indexing for every write synchronously

Couples write latency to search indexing. Make indexing async via CDC/outbox. Accept a few seconds of search lag.

Using LIKE at scale

A full table scan per query. Dies at hundreds of thousands of rows. Postgres FTS is the first step up; a real search engine is the next.

No rebuild storyAdvanced

Analyzer change? Schema change? Bug in indexer? You need to be able to rebuild. Plan the alias pattern from day one.

Practice drills

Explain an inverted index in 30 seconds.Reveal

For every term (word) in the corpus, store a posting list of (doc_id, term_frequency, positions). Search for "quick brown" looks up each term's list, intersects the doc_ids (AND) or unions them (OR), scores each match using BM25 or TF-IDF, sorts, returns top K. The index is expensive to build but cheap to query — the opposite of a B-tree, which is cheap to update but can't answer "which docs contain this word" efficiently.

Your search results are "correct but boring" — popular items don't rank first. What do you do?Reveal

Layer business signals on top of BM25. In Elasticsearch, use function_score to multiply base relevance by a popularity boost: score * log1p(sales_30d). Or a recency boost: score * exp(-age / decay). Don't overwrite BM25; boost it. Measure with A/B tests — click-through rate, conversion, dwell time. Iterate.

Interviewer: "you changed the analyzer. Existing index is now mismatched. What's your process?"Reveal

Create a new index with the new analyzer (e.g., products-v4). Bulk-reindex from the source DB (or Elasticsearch's reindex API from the old index). Verify the new index returns expected results. Flip the alias (products → v4) atomically. Keep v3 for ~24h for rollback. Delete v3 after confidence.

Cheat sheet

•LIKE is not search.
•<10M docs: Postgres FTS is often enough.
•>10M or need facets: Elasticsearch / OpenSearch.
•Great DX + moderate scale: Meilisearch / Typesense.
•DB = source of truth. Index = derived view via CDC.
•Alias pattern for atomic reindex + rollback.
•BM25 base score + function_score for business signals.

Practice this skill

No problem is tagged directly to Search indexing yet. These published problems still exercise the same interview category.

search autocomplete ticket booking news feed

Read this if

Option

Scale

Relevance knobs

Ops cost

SQL LIKE / ILIKE

Up to millions of rows

Effectively none

Zero (you already have the DB)

Postgres full-text (tsvector)

Millions of rows

Decent

Low — it's a Postgres feature

Meilisearch / Typesense

10M docs

Good, typo-tolerance built in

Low — single binary

Elasticsearch / OpenSearch

Billions

Full BM25, aggregations, custom scoring

Non-trivial — cluster operations

Vespa / Solr

Billions + ML re-rank

Advanced

High; for teams with search as core product