Pattern·Specialised shapes

Content recommendation

Feature store + candidate generation + ranking. Offline training, online serving. Separate the cheap retrieval from the expensive scoring.

~60 min read·16 sections

Recommendation is a ranking problem layered on a retrieval problem — and the interview signal is separating the two. Candidates who say "train a neural net on clicks" have skipped the entire system.

Feature store + candidate generation + ranking. Offline training, online serving. Separate the cheap retrieval from the expensive scoring.

Architecture diagram· Candidate generation + ranking

Cheap retrieval narrows millions of items to ~1000; an expensive model ranks those; a feature store feeds both, and serving logs become tomorrow's training data.

You’re looking at this pattern when

"Recommended for you" / personalised feeds
Personalised home screens and rows
E-commerce cross-sell / up-sell

Shows up in

YouTube watch-next
Spotify Discover Weekly

Try it on

What most people get wrong

Small catalogue (< ~1000 items) — just popularity-sort or rules

When to reach for this

Reach for this when…

"Recommended for you" / personalised feeds
Personalised home screens and rows
E-commerce cross-sell / up-sell
"People you may know" / friend suggestions
Music / video autoplay and watch-next

Not really this pattern when…

Small catalogue (< ~1000 items) — just popularity-sort or rules
A pure chronological feed with no personalisation
Regulatory or product constraints prohibit personalisation

Good vs bad answer

Interviewer probe

“Design a "Recommended for you" feed.”

Weak answer

"Train a neural network on click data and use it to recommend the highest-scoring items."

Strong answer

"Two-stage. Candidate generation blends channels — a two-tower embedding ANN (top 500), collaborative filtering (top 200), and trending + fresh items (top 100) — merged and deduped to ~1000 candidates, so the slate stays diverse and no single channel's bias dominates. Ranking is a GBDT scoring each candidate with features from a feature store: user 7-day activity, item age and CTR, creator affinity, and context like time and device — and crucially the feature store serves the same definitions to training and to the online ranker, so there's no train/serve skew. I return the top ~20 with a ~10% epsilon-greedy exploration slice so fresh and uncertain items keep earning signal and the feedback loop doesn't collapse. Training data is logged impressions + clicks; I retrain daily and optimise watch time, not raw CTR, validated by A/B test. Cold-start: new users get popularity-in-segment then progressive personalisation; new items get content-based ANN from metadata. Latency budget is ~50 ms p99, so I cap candidate count and batch the feature fetch."

Why it wins: Names the two stages explicitly, multi-channel candidate gen with dedupe, the feature store as the anti-skew mechanism, exploration against feedback-loop collapse, separate user/item cold-start, the right objective with online validation, and a latency budget. The weak answer is "one big model on clicks" — which can't meet latency, has no exploration, no cold-start, and optimises clickbait.

Cheat sheet

•Two stages: candidate gen (cheap, ~1000) + ranker (expensive, top K).
•Candidate gen is recall; ranking is precision.
•Never a single candidate source — blend ANN + CF + trending + fresh, dedupe.
•Feature store: identical definitions for train and serve (kills skew).
•Train/serve feature skew is the #1 production failure mode.
•Start with GBDT; move to a neural ranker only when it plateaus.
•Always explore 5–10% — no impressions, no labels.
•Cold-start is two problems: users (popularity-segment) and items (content ANN).
•Optimise watch time / retention, not raw CTR; validate with A/B tests.
•Retrain on a cadence matching drift (often daily).
•Latency budget ~50 ms p99 — cap candidates, batch feature fetches.
•Log serving-time features so offline debugging reproduces real decisions.

Core concept

The single architectural idea that runs the whole field is the two-stage funnel: a cheap candidate generation stage narrows millions (or billions) of items to ~1000 in milliseconds, then an expensive ranking stage scores just those candidates with a rich model and returns the top K. You can't run a heavyweight model over the entire catalogue inside a 50 ms request, and you can't get good personalisation from cheap retrieval alone — so you do both, each tuned for its job.

Architecture diagram· Why two stages: funnel from millions to a slate

Each stage trades recall for precision — cheap retrieval over the whole catalogue, then expensive scoring over a bounded candidate set.

Candidate generation (cheap retrieval, high recall):

Collaborative filtering / matrix factorisation — "users who liked X also liked Y", batch-computed (ALS) or learned as two-tower embeddings.
Content-based embeddings + ANN — embed items and users, then approximate-nearest-neighbour search (FAISS, ScaNN, Pinecone) returns similar items in milliseconds over 100M+ items.
Heuristic channels — "trending now", "new in your city", "friends of friends" — simple SQL/Redis aggregations. Always blend a few in.

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

The rule is never a single source: blend embedding ANN, collaborative filtering, trending, and a fresh/exploration channel, then dedupe into one candidate pool. A single channel collapses diversity and amplifies whatever bias it carries.

Ranking (expensive scoring, high precision): a model — GBDT (LightGBM/XGBoost) as the strong baseline, graduating to a neural ranker (DeepFM, two-tower) later — scores each candidate against rich features pulled from a feature store: user features (recent activity, demographics), item features (age, category, CTR), and context features (time of day, device). Training labels come from implicit feedback — clicks, watch time, dwell.

Architecture diagram· Feature store: one definition, train and serve

A single feature pipeline writes an offline store (training) and an online store (serving) so the model sees identical features in both — defeating train/serve skew.

The feedback loop closes the system. Every served slate logs impressions and clicks; those logs are the training data for tomorrow's model. You retrain on a cadence that matches catalogue drift. But the loop has a pathology — a greedy model only ever shows what it already favours, only gets feedback on those items, and reinforces itself until coverage collapses. The fix is a small exploration budget (epsilon-greedy or Thompson sampling, ~5–10%) that keeps fresh and uncertain items flowing so the model never stops learning.

Interview walkthrough

Worked example: personalised home feed for a video app

Architecture diagram· Candidate generation + ranking

Cheap retrieval narrows millions of items to ~1000; an expensive model ranks those; a feature store feeds both, and serving logs become tomorrow's training data.

Step 0 — frame it. Don't score every video with a giant model — that can't meet the ~50 ms budget. It's a two-stage funnel: cheap candidate generation, then expensive ranking, then a small re-rank for diversity and exploration.

Step 1 — multi-channel candidate generation. Blend channels, each with a top-N: two-tower embedding ANN (personalised similarity, top 500), collaborative filtering (taste neighbourhoods, top 200), trending-in-region (social proof, top 100), subscriptions/follows (relationship, top 100), and a fresh/exploration channel (new uploads, top 100). Merge and dedupe to ~1000 — an item retrieved by several channels is a strong signal.

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

Step 2 — rank with a feature store. A GBDT scores each candidate using features pulled in one batched multi-get from the feature store: user features (recent watches, watch-time profile), item features (age, category, CTR), creator affinity, and context (time, device). The feature store serves the same definitions offline (training) and online (serving), so there's no skew.

Architecture diagram· Feature store: one definition, train and serve

A single feature pipeline writes an offline store (training) and an online store (serving) so the model sees identical features in both — defeating train/serve skew.

Step 3 — re-rank, explore, return. Apply diversity constraints (don't fill the slate with one creator/genre) and swap ~10% of slots for exploration — fresh or high-uncertainty items via epsilon-greedy/bandit — so the catalogue keeps getting impressions. Return the top K. Optimise watch time, not clicks.

Architecture diagram· The closed loop — and where it collapses

Serving logs train the next model, which shapes what is shown, which shapes the next logs. Without exploration the loop reinforces itself and coverage shrinks.

Step 4 — close the loop. Log every impression and click; those logs train tomorrow's model. Retrain daily to track drift. The exploration budget keeps the logs honest by ensuring new items get shown.

Step 5 — cold-start.

New user: popularity-in-segment/context for the first session, then personalise progressively as interactions arrive.
New item: content-based ANN from metadata so it's retrievable immediately, plus exploration impressions to earn real signal; the behavioural embedding takes over once data accrues.

Architecture diagram· Cold-start fallbacks for users and items

New users and new items have no interaction history — fall back to content embeddings and popularity-in-segment until signal accrues.

Step 6 — meet the latency budget. ~50 ms p99: cap candidate count, batch the feature fetch into one multi-get, cache per-user features for the session, and keep the ranker lean (add a pre-ranker only if needed).

Architecture diagram· Latency budget across the funnel

A ~50 ms p99 budget is split across candidate gen, batched feature fetch, and ranking — caps on candidate count keep the ranker affordable.

Result. Diverse, personalised recall; precise ranking on consistent features; continuous learning with healthy exploration; explicit cold-start for both users and items; the right objective validated online; and a slate that returns within budget.

Interview playbook

Interview playbook7-9 minutes in a 45-minute round: 1-2 min framing the two-stage funnel, 2 min on multi-channel candidate gen + ranking, 2 min on the feature store/skew, 2 min on feedback loop + exploration + cold-start, 1 min on the latency budget.

When it comes up

The prompt says recommended, personalised, home feed, watch-next, or people-you-may-know
The catalogue is too large to rank exhaustively online
User behaviour logs drive future ranking quality
A media, social, or e-commerce discovery surface is in scope

Order of reveal

1
Split retrieval and ranking. Candidate generation narrows millions to ~1000 cheaply; the ranker scores those expensively.
2
Use multiple channels. Blend embedding ANN, collaborative filtering, trending, and fresh — then dedupe.
3
Unify features. A feature store serves identical definitions to training and serving, killing skew.
4
Log impressions. What was shown and how users reacted is the training data — no impressions, no labels.
5
Add exploration. A 5–10% budget prevents feedback-loop collapse and warms cold items.
6
Plan cold-start and metric. Separate user/item fallbacks; optimise watch time/retention, validated by A/B test.

Signature phrases

“Candidate gen is recall; ranking is precision”

“No impressions, no labels”

“Train/serve skew kills recommender quality”

“Never a single candidate source”

“Optimise watch time, not clicks”

“Cold-start is two problems, not one”

“Candidate gen is recall; ranking is precision” — Separates the two jobs and their objectives cleanly.
“No impressions, no labels” — Explains how training data is created and why exploration matters.
“Train/serve skew kills recommender quality” — Names the dominant production ML failure mode.
“Never a single candidate source” — Captures the multi-channel diversity principle.
“Optimise watch time, not clicks” — Shows objective awareness beyond raw CTR.
“Cold-start is two problems, not one” — Distinguishes user-side from item-side fallbacks.

Likely follow-ups

?“How do you handle a brand-new user?”Reveal

Start with popularity-in-segment/context — the best items for their coarse attributes (region, device, declared interests at signup) — and optionally collect a couple of lightweight preference signals. Then personalise progressively as the first interactions land; the system should visibly adapt within the first session rather than pretending CF works with no history. This is distinct from new-item cold-start, which uses content embeddings.

?“Offline AUC improves but online CTR is flat. Why?”Reveal

Top suspects in order: (1) train/serve feature skew — features computed differently offline vs online, so the model sees a shifted distribution at serving; (2) selection bias in the logs — training only on shown items teaches "what the old system surfaced", not true preference; (3) stale or low-freshness features; (4) metric mismatch — AUC isn't CTR isn't retention. I'd run shadow evaluation and an A/B test rather than trusting offline AUC, and log serving-time feature values to reproduce real decisions.

?“How do you stop the feed becoming a filter bubble?”Reveal

Exploration plus the right metrics. A greedy ranker only learns about what it shows, so I reserve ~5–10% of the slate for fresh/high-uncertainty items (epsilon-greedy or Thompson sampling), which keeps new content earning impressions and the logs honest. I also enforce diversity constraints in re-ranking (cap per-creator/genre) and evaluate on long-horizon retention and catalogue coverage, not just immediate CTR — because optimising raw clicks accelerates the collapse.

?“Your rec call must return in 50 ms. Where does the time go and how do you protect it?”Reveal

The two blow-ups are candidate count × per-item feature lookups, and model cost. I cap the candidate count (more candidates rarely help the final slate proportionally), batch the feature fetch into one multi-get against the online store instead of per-item reads, and cache per-user features for the session so they're fetched once. I keep the ranker lean (GBDT), warm model/feature caches, and if needed add a lightweight pre-ranker to trim 1000→200 before the expensive model — a mini two-stage within ranking.

?“When would you move from GBDT to a neural ranker?”Reveal

Only when GBDT has clearly plateaued and I have the data scale and serving infra to justify the cost. GBDTs are a strong, fast, interpretable baseline that handle tabular cross features well, so they're the right starting point; jumping straight to deep learning pays latency and operational cost without first proving the tree model is the bottleneck. The trigger is measured: GBDT improvements flatten, there's lots of data, and a neural model wins in offline + online tests by enough to pay for its serving cost.

Canonical examples

→YouTube watch-next
→Spotify Discover Weekly
→Amazon "customers also bought"
→Instagram / TikTok feed
→Netflix home rows

Variants

Two-tower embeddings + ANN retrieval

Learn user and item embeddings in a shared space; retrieve nearest items by ANN in milliseconds.

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

The modern default for candidate generation. A two-tower model trains a user tower and an item tower to map both into the same embedding space, so a user's vector is close to the vectors of items they'd engage with. Item embeddings are computed offline and indexed in an ANN store (FAISS, ScaNN, Pinecone); at request time you compute the user vector and do a nearest-neighbour lookup to retrieve the top few hundred candidates from a catalogue of hundreds of millions — all in single-digit milliseconds.

The towers are trained on engagement pairs (user, item-they-engaged-with) with in-batch negatives. The big win is that retrieval decouples from catalogue size: ANN is sub-linear, so adding items doesn't slow the lookup. The watch-out is embedding staleness (item embeddings drift as content/behaviour changes — re-index on a cadence) and that the towers can't see cross features between a specific user and item (that's the ranker's job).

Pros

+Millisecond retrieval over 100M+ items (ANN is sub-linear)
+Personalised recall, not just popularity
+Decouples retrieval cost from catalogue size

Cons

−No user×item cross features — ranker must add them
−Embeddings go stale; need periodic re-indexing
−Training infra (negatives, towers) is non-trivial

Choose this variant when

Large catalogue needing personalised recall
You have engagement data to train towers
Sub-10ms candidate retrieval is required

Collaborative filtering / matrix factorisation

"Users who liked X also liked Y" from the interaction matrix — a strong, explainable baseline.

Architecture diagram· Candidate generation + ranking

Cheap retrieval narrows millions of items to ~1000; an expensive model ranks those; a feature store feeds both, and serving logs become tomorrow's training data.

Classic and still effective. Factorise the sparse user×item interaction matrix (via ALS or SGD) into low-rank user and item factors; the dot product predicts affinity, and you retrieve the top items for a user's factor vector. Item-item CF ("people who bought this also bought…") is the workhorse behind Amazon-style co-purchase recommendations and is cheap to precompute and serve.

CF excels when you have dense interaction data and want a transparent, well-understood baseline — it captures collaborative signal (taste neighbourhoods) that content features alone miss. Its weaknesses are the textbook ones: cold-start (new users/items have no row/column in the matrix) and popularity bias (heavily-interacted items dominate). In practice CF is one channel in the blend, not the whole system — complementing embedding ANN and content-based retrieval.

Pros

+Captures collaborative taste signal cheaply
+Explainable ("also bought") and easy to precompute
+Strong baseline with dense interaction data

Cons

−Cold-start: no factors for new users/items
−Popularity bias toward heavily-interacted items
−Static between batch recomputes

Choose this variant when

Dense interaction data and a co-engagement signal
You want an explainable, cheap baseline channel
Co-purchase / "also viewed" surfaces

GBDT ranker over a feature store

Gradient-boosted trees scoring candidates with rich features — the right ranking baseline.

Architecture diagram· Latency budget across the funnel

A ~50 ms p99 budget is split across candidate gen, batched feature fetch, and ranking — caps on candidate count keep the ranker affordable.

For the ranking stage, start with gradient-boosted decision trees (LightGBM, XGBoost), not deep learning. GBDTs are fast to train and serve, handle heterogeneous tabular features (numeric, categorical, counts) with little preprocessing, are relatively interpretable (feature importances), and deliver a strong baseline that's hard to beat without significant investment. They score each of the ~1000 candidates against user, item, context, and crucially user×item cross features pulled from the feature store, producing a ranked slate.

Graduate to a neural ranker (DeepFM, wide-and-deep, two-tower with attention) only when GBDT plateaus and you have the data scale and serving infra to justify it. Jumping straight to deep learning is the classic over-engineering mistake — you pay the latency and operational cost without first proving GBDT is the bottleneck.

Pros

+Strong baseline, fast to train and serve
+Handles tabular/cross features with little prep
+Interpretable via feature importances

Cons

−Plateaus on very large data vs deep models
−Less able to learn raw embeddings end-to-end

Choose this variant when

You need a strong ranking baseline quickly
Features are tabular with cross terms
Latency budget is tight (trees are cheap)

Bandit / exploration layer

Reserve a slice of the slate for exploration to break the feedback loop and warm cold items.

Architecture diagram· The closed loop — and where it collapses

Serving logs train the next model, which shapes what is shown, which shapes the next logs. Without exploration the loop reinforces itself and coverage shrinks.

Pure greedy ranking creates a self-reinforcing loop: the model shows what it already favours, only learns from those impressions, and starves new or uncertain items of the signal they need — coverage collapses into a filter bubble. An exploration layer fixes this by deliberately allocating a small fraction of slate slots (≈5–10%) to items the model is uncertain about or that are fresh.

Epsilon-greedy is the simple version (with probability ε, show a random/fresh item instead of the greedy pick). Thompson sampling / contextual bandits is the principled version, sampling from the posterior so exploration concentrates where uncertainty is highest and naturally decays as confidence grows. Either way you trade a little short-term CTR for long-term learning, coverage, and resistance to feedback-loop collapse — and you measure success on long-horizon retention and catalogue coverage, not just immediate clicks.

Pros

+Breaks feedback-loop collapse and filter bubbles
+Warms cold items by giving them impressions
+Bandits focus exploration where uncertainty is high

Cons

−Small short-term CTR cost from non-greedy picks
−Needs long-horizon metrics to evaluate correctly

Choose this variant when

Catalogue has constant fresh content (news, UGC)
You see coverage shrinking or filter bubbles
You can measure long-term retention, not just CTR

Scaling path

v1 — popularity + simple collaborative filtering

Ship useful recommendations without ML infrastructure.

Start with popularity-by-segment and precomputed item-item collaborative filtering ("also viewed/bought"). A nightly batch job computes co-engagement and writes top lists to Redis; serving is a key lookup. This is genuinely useful for a small-to-medium catalogue and needs no model serving, feature store, or ANN index.

Architecture diagram· Candidate generation + ranking

Cheap retrieval narrows millions of items to ~1000; an expensive model ranks those; a feature store feeds both, and serving logs become tomorrow's training data.

It plateaus when the catalogue grows past what precomputed lists cover well, when you need real personalisation beyond co-engagement, and when cold-start items get no exposure. That's the trigger to add learned retrieval and a ranker.

What triggers the next iteration

Popularity bias drowns the long tail
No true personalisation beyond co-engagement
Cold-start items never surface

v2 — two-stage: ANN candidate gen + GBDT ranker

Personalise at scale by splitting retrieval from ranking.

Introduce the two-stage funnel. Train two-tower embeddings, index items in an ANN store, and blend ANN retrieval with CF and trending into a ~1000-candidate pool. A GBDT ranker scores candidates with features and returns the top K.

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

This is the architecture most production systems run. The new responsibilities it creates — consistent features across training and serving, and a latency budget for the ranker — define the next steps.

What triggers the next iteration

Train/serve feature skew degrades online quality
Ranker latency grows with candidate count
Embeddings drift and need re-indexing

v3 — feature store + real-time features

Eliminate train/serve skew and react to in-session behaviour.

Add a feature store that serves the same feature definitions to offline training and online serving, with an online layer fed by a stream (Kafka) for real-time features like "last 5 clicks". This kills the single biggest production failure mode — features computed one way in training and another at serving — and lets the model react within a session.

Architecture diagram· Feature store: one definition, train and serve

A single feature pipeline writes an offline store (training) and an online store (serving) so the model sees identical features in both — defeating train/serve skew.

Now the system reacts to fresh behaviour and offline metrics start predicting online ones. Remaining concerns: keeping the feedback loop healthy and handling cold-start.

What triggers the next iteration

Streaming feature freshness vs cost
Feature versioning across model versions
Online store read latency in the budget

v4 — exploration, cold-start, and continuous retraining

Keep the model learning, fair, and fresh over time.

Add an exploration budget (epsilon-greedy or bandits) to break feedback-loop collapse, explicit cold-start fallbacks for new users (popularity-in-segment) and new items (content-based ANN), and a retraining pipeline on a cadence matching drift (often daily). Diversity/business constraints re-rank the final slate.

Architecture diagram· The closed loop — and where it collapses

Serving logs train the next model, which shapes what is shown, which shapes the next logs. Without exploration the loop reinforces itself and coverage shrinks.

This is the mature system: it learns continuously, surfaces new content, recovers from drift, and balances engagement against diversity and long-term retention.

What triggers the next iteration

Tuning exploration vs short-term CTR
Retraining cost and pipeline reliability
Balancing diversity constraints against relevance

Deep dives

Why two stages — and not one big model

Architecture diagram· Why two stages: funnel from millions to a slate

Each stage trades recall for precision — cheap retrieval over the whole catalogue, then expensive scoring over a bounded candidate set.

The instinct to "just train one model to score everything" dies on latency arithmetic, and walking through it is the core signal.

Architecture diagram· Why two stages: funnel from millions to a slate

Each stage trades recall for precision — cheap retrieval over the whole catalogue, then expensive scoring over a bounded candidate set.

A rich ranking model costs, say, tens of microseconds to score one (user, item) pair with all its features. Scoring a 100-million-item catalogue per request would take minutes — utterly impossible inside a ~50 ms budget. So you split the problem by what each stage optimises:

Candidate generation optimises recall, cheaply. ANN over embeddings, CF lookups, and trending lists each return their best few hundred items in milliseconds using sub-linear or precomputed structures. The goal is "don't miss anything good" while keeping the set small (~1000). Errors here are omissions.
Ranking optimises precision, expensively. Now that the set is bounded, you can afford a heavy model with rich user×item cross features to order those ~1000 precisely and pick the top K. Errors here are mis-orderings.

This is the same cheap-recall-then-expensive-rank shape as search and geospatial dispatch, and naming that parallel signals pattern fluency. A subtle point: the two stages are trained on different objectives (retrieval vs ranking) and can disagree — the ranker may down-rank a candidate the retriever loved — which is healthy; the funnel is designed to be generous early and selective late.

Multi-channel candidate generation — never a single source

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

A recommender that retrieves candidates from one source inherits that source's blind spots, and the fix is to blend complementary channels.

Architecture diagram· Multi-channel candidate generation

Never one source — blend embedding ANN, collaborative filtering, trending, fresh, and social channels, then dedupe into one candidate pool.

Each channel contributes a different kind of signal:

Embedding ANN — personalised similarity ("things like what you engage with").
Collaborative filtering — taste-neighbourhood signal ("people like you liked").
Trending / popular-in-segment — social proof and timeliness.
Fresh / exploration — new items with no history yet, so the catalogue doesn't ossify.
Social ("from people you follow") — relationship signal where it applies.

You pull the top-N from each channel, merge and dedupe into a single ~1000-candidate pool (an item retrieved by several channels is a strong signal), and hand it to the ranker. The blend is what keeps the slate diverse: relying on embeddings alone amplifies whatever the embeddings over-favour; adding trending and fresh channels guarantees timely and new content gets a shot. The channel weights themselves become a tuning knob — and a place to enforce product goals (e.g. guarantee a fresh slot, or cap any one creator). The phrase to land: candidate generation is recall; ranking is precision — and recall is a portfolio, not a single bet.

Train/serve skew is THE production failure mode

Architecture diagram· Feature store: one definition, train and serve

A single feature pipeline writes an offline store (training) and an online store (serving) so the model sees identical features in both — defeating train/serve skew.

The most common way a recommender that "works in the notebook" fails in production isn't a bad model — it's train/serve feature skew.

Architecture diagram· Feature store: one definition, train and serve

A single feature pipeline writes an offline store (training) and an online store (serving) so the model sees identical features in both — defeating train/serve skew.

The trap: training features are computed in a batch job (SQL joins over a warehouse, "30-day watch time" defined one way, timestamps rounded one way), while serving features are computed in the online path (cached values, a slightly different window, different rounding). The model learns the relationship between the training feature distribution and the label, then at serving time sees subtly different numbers — so its predictions degrade in ways offline metrics never reveal. Offline AUC goes up; online CTR stays flat.

The fix is a feature store with a single source of truth for feature definitions:

One pipeline computes each feature, materialising it to an offline store (for training) and an online store (for low-latency serving), guaranteeing the same definition in both.
Streaming features (last 5 clicks, current session) flow through Kafka into the online layer so the model can react in-session.
Feature definitions are versioned, and the exact feature values used at serving time are logged, so an offline debugging run can reproduce the real decision.

"Train/serve skew kills recommender quality" and "one codepath generates features for both train and serve" are the sentences that show you've operated one of these systems rather than just read about the model.

The feedback loop and why you must explore

Architecture diagram· The closed loop — and where it collapses

Serving logs train the next model, which shapes what is shown, which shapes the next logs. Without exploration the loop reinforces itself and coverage shrinks.

A recommender is a closed loop, and left greedy it eats itself.

Architecture diagram· The closed loop — and where it collapses

Serving logs train the next model, which shapes what is shown, which shapes the next logs. Without exploration the loop reinforces itself and coverage shrinks.

Every slate you serve generates the impressions and clicks that become training data for the next model. But the model can only get feedback on items it showed. A purely greedy ranker shows what it already believes users like, so it only ever learns about those items, reinforces its existing beliefs, and progressively buries everything else — new content, niche tastes, the long tail. Coverage shrinks into a filter bubble, and the selection bias in the logs (you only see labels for shown items) means the model is learning "what the old system showed," not "what users actually want."

Two mechanisms counter this:

Exploration. Reserve ~5–10% of slate slots for non-greedy picks — fresh items, high-uncertainty items, or bandit-selected variants. Epsilon-greedy is the simple form; Thompson sampling / contextual bandits concentrate exploration where uncertainty is highest and decay it as confidence grows. You trade a little immediate CTR for the signal that keeps the model learning and the catalogue alive.
Right metrics. Optimising pure short-term CTR accelerates the collapse (clickbait, repetition). Track long-horizon retention, catalogue coverage, and diversity alongside CTR so the loop's health is visible.

The interview line: no impressions, no labels — and a greedy system stops generating impressions for anything new, so exploration isn't a nicety, it's what keeps the training data honest.

Cold-start: plan separately for users and items

Architecture diagram· Cold-start fallbacks for users and items

New users and new items have no interaction history — fall back to content embeddings and popularity-in-segment until signal accrues.

Collaborative signal needs history, and two situations have none — a new user and a new item — which require different fallbacks. Conflating them is a common miss.

Architecture diagram· Cold-start fallbacks for users and items

New users and new items have no interaction history — fall back to content embeddings and popularity-in-segment until signal accrues.

New user (no interaction history). CF and personalised embeddings have nothing to work with, so fall back to popularity-in-segment / context: best items for the user's coarse attributes (region, device, referral, declared interests at signup). Optionally collect a few lightweight preference signals up front. Then progressively personalise as the first interactions arrive — the system should visibly adapt within the first session, not pretend it knows the user.

New item (no clicks yet). It has no CF column and no learned embedding from behaviour, but it does have content: title, description, category, creator, media. Use a content-based embedding (from item metadata) to place it near similar items in the ANN space, so it can be retrieved for users who like that neighbourhood — and give it exploration impressions so it can earn real engagement signal. Once it accumulates interactions, the behavioural embedding takes over.

The general principle: cold-start is solved by falling back to a different signal that doesn't require history (content, popularity) and then graduating to behavioural signal as data accrues. Naming both the user-side and item-side fallbacks distinctly is the senior tell.

Serving inside a latency budget

Architecture diagram· Latency budget across the funnel

A ~50 ms p99 budget is split across candidate gen, batched feature fetch, and ranking — caps on candidate count keep the ranker affordable.

A rec call has a hard latency budget — often ~50 ms p99, because it sits on a page load — and the whole architecture is shaped by it.

Architecture diagram· Latency budget across the funnel

A ~50 ms p99 budget is split across candidate gen, batched feature fetch, and ranking — caps on candidate count keep the ranker affordable.

The budget is split across the funnel: candidate generation (ANN lookup + channel merges), feature fetching, and ranking. The two places it blows up:

Candidate count × feature lookups. Ranking N candidates means fetching features for N items and N user×item pairs. If you naively fetch features one item at a time, latency explodes. The fixes: cap the candidate count (more candidates rarely improve the final slate proportionally), batch the feature fetch into one multi-get against the online store, and precompute/cache user features per request so they're fetched once, not per candidate.
Model cost. A heavy neural ranker over 1000 candidates can exceed the budget. Mitigations: keep the ranker lean (GBDT is fast), cap candidates, warm model and feature caches, and consider a lightweight pre-ranker to trim 1000→200 before the expensive model — a mini two-stage within ranking.

Caching helps broadly: per-user candidate sets and feature vectors can be cached for the session, and popular contexts (logged-out home, trending rows) can be cached outright. The discipline is to treat the budget as a design constraint, allocate it across stages explicitly, and make every per-candidate operation batched — the same way you'd budget a search query.

Decision levers

Candidate-gen source mix

Never one source. Blend embedding ANN (top 500) + CF (top 200) + trending (top 100) + a fresh/explore channel (top 100), then merge and dedupe to ~1000. The blend keeps the slate diverse and resilient to any single channel's bias.

Offline vs online features

Batch features (30-day activity) computed daily and served from the feature store; real-time features (last 5 clicks) streamed via Kafka into the online layer. Skew between offline-training and online-serving features is the #1 cause of "great offline, dead online".

Ranking model

Start with GBDT (LightGBM/XGBoost) — fast, interpretable, strong. Graduate to a neural ranker (DeepFM, two-tower) only when GBDT plateaus and the infra justifies it. Don't open with deep learning.

Exploration strategy

Epsilon-greedy (simple) or Thompson sampling / contextual bandits (principled). ~5–10% of the slate. Without exploration, coverage collapses and cold items never earn signal.

Objective / metric

Optimise the business outcome — watch time, retention, conversions — not raw clicks (clickbait) or offline AUC. Validate online with A/B tests; track diversity and coverage alongside the headline metric.

Cold-start plan

Separate fallbacks: new users → popularity-in-segment/context + progressive personalisation; new items → content-based ANN from metadata + exploration impressions. Graduate to behavioural signal as data accrues.

Retraining cadence

Match catalogue/behaviour drift — often daily for fast-moving catalogues (news, social), weekly for slower ones. Stale models miss preference drift and new content.

Failure modes

Train/serve feature skewAdvanced

Training features come from warehouse joins; serving features from cached online values computed slightly differently. They diverge, offline metrics lie, online quality drops. Fix: one feature pipeline (feature store) feeding both, versioned, with serving-time values logged.

Feedback-loop collapseAdvanced

A greedy model only shows what it favours, only learns from those impressions, and reinforces itself until coverage shrinks into a filter bubble. Fix: a 5–10% exploration budget (epsilon-greedy or bandits).

Cold-start ignored

New users and new items have no history, so CF/embeddings produce nothing. Fix: popularity-in-segment for users, content-based ANN for items, and progressive reveal as data accrues — handled separately.

Stale models

A model trained months ago misses preference drift and catalogue changes. Fix: retrain on a cadence matching drift (often daily) and monitor for distribution shift.

P99 blown by the rankerAdvanced

A heavy model scoring N candidates × K per-item feature lookups explodes latency. Fix: cap candidate count, batch feature fetches into one multi-get, warm caches, and consider a lightweight pre-ranker.

Optimising the wrong metricAdvanced

Maximising raw CTR rewards clickbait and accelerates feedback-loop collapse. Fix: optimise watch time/retention, validate with online A/B tests, and track diversity/coverage — not offline AUC alone.

Single candidate source

One channel (say embeddings) inherits its blind spots and amplifies its bias, collapsing diversity. Fix: blend ANN + CF + trending + fresh and dedupe.

Case studies

YouTube

YouTube — the canonical two-stage deep recommender

YouTube's recommendation system is the textbook reference for the two-stage architecture, described in their 2016 paper "Deep Neural Networks for YouTube Recommendations." Faced with a corpus of hundreds of millions of videos and a hard serving latency budget, they explicitly split the problem into candidate generation and ranking.

The candidate-generation network treats recommendation as extreme multiclass classification — it learns user and video embeddings and, at serving time, performs an approximate nearest-neighbour lookup of the user vector against video vectors to retrieve a few hundred candidates from the millions, in milliseconds. This is the two-tower/ANN retrieval pattern at planet scale. The ranking network then scores those few hundred candidates with a much richer feature set — including features that only make sense on a small candidate set, like the user's history with that specific channel, time since last watch, and impression-frequency capping — and orders them.

Two details the paper emphasises that map directly onto this pattern: they rank on expected watch time, not click probability, because clicks reward clickbait while watch time rewards genuine satisfaction (the "right metric" lesson); and they take great care with train/serve consistency and feature freshness, including using the age of a training example as a feature so the model doesn't over-favour stale popular videos. The architecture is the blueprint: cheap deep-retrieval candidate gen → rich deep ranking → careful objective and feature engineering.

Takeaway: YouTube's two networks — embedding/ANN candidate generation then a rich ranker on the small candidate set — are the canonical two-stage design, and ranking on watch-time rather than clicks is the textbook "optimise the right metric" lesson.

Spotify

Spotify Discover Weekly — blending CF, content (NLP/audio), and exploration

Spotify's Discover Weekly is a celebrated example of multi-channel candidate generation producing a personalised, diverse slate. It blends three complementary signals rather than betting on one: collaborative filtering over listening behaviour and the billions of user-curated playlists (tracks that co-occur in similar playlists are similar — exactly the taste-neighbourhood signal CF captures); NLP over text scraped from the web, blogs, and playlist titles to build a semantic profile of artists and tracks; and raw audio models (CNNs over spectrograms) that let brand-new tracks with no listening history be placed near sonically similar songs — a direct content-based cold-start solution for items.

That combination is why Discover Weekly can surface obscure and new music a pure CF system would never reach: CF supplies the collaborative backbone, while the content (text + audio) channels rescue the long tail and cold items. The weekly cadence is itself a form of structured exploration — a fresh slate every Monday keeps users discovering rather than looping the same favourites, and Spotify watches engagement with these recommendations to feed the next cycle.

For an interview, Discover Weekly is the perfect illustration of two pattern principles at once: never a single candidate source (CF + NLP + audio), and content-based embeddings solve item cold-start (audio models for tracks with no plays yet).

Takeaway: Discover Weekly blends collaborative filtering with content channels (NLP on text, CNNs on raw audio) so brand-new and long-tail tracks with no listening history can still be recommended — the multi-channel + content-cold-start principles in one product.

Netflix

Netflix — personalising the whole page, and optimising for retention

Netflix treats the home screen as a personalised ranking problem at two levels: which rows (genres/themes) to show, and which titles to rank within each row — both personalised per member. Their long-running work (rooted in the Netflix Prize era and far beyond it) established several lessons this pattern encodes.

First, the business metric matters more than offline accuracy: Netflix optimises for long-term member retention, not raw predicted-rating error, and has publicly noted that the rating-prediction accuracy that won the Prize was less valuable than improvements to ranking, page construction, and diversity. This is the "optimise the right metric" lesson taken to its conclusion — they evaluate via online A/B tests on retention, not offline AUC. Second, diversity and page construction are first-class: a row of ten near-identical thrillers underperforms a varied row, so re-ranking enforces diversity rather than pure relevance. Third, even the artwork/thumbnail is personalised and bandit-optimised — a contextual-bandit problem that surfaces different imagery to different members, a concrete instance of the exploration layer.

Netflix is the case study for the parts of recommendation beyond the model: optimise the real objective (retention), construct and diversify the whole page, run everything through online experiments, and use bandits for exploration even at the presentation layer.

Takeaway: Netflix optimises long-term retention (validated by online A/B tests, not offline AUC), personalises and diversifies the entire page rather than a single list, and uses contextual bandits down to artwork — the recommendation lessons that live beyond the ranking model.

Decision table

Recommendation splits cheap retrieval from expensive ranking, with a learning loop.

Layer	Default	Trade-off	Robust answer includes
Candidate generation	Multi-channel: ANN + CF + trending + fresh	Recall vs diversity	Channel mix, dedupe, weighting
Ranking	GBDT over the feature store	Quality vs latency	Feature freshness, p99 budget, cross features
Features	Feature store, shared train/serve	Freshness vs cost	Skew prevention, versioning, value logging
Exploration	Epsilon-greedy / bandit slot	Short-term CTR vs learning	Coverage, feedback-loop control
Cold start	Popularity-segment (users) + content ANN (items)	Less personalisation early	Separate user and item fallbacks

Never a single candidate source — blend channels and dedupe.
Optimise the business metric (watch time / retention), not just CTR or offline AUC.

Drills

Why not just one big model that scores everything?Reveal

Latency arithmetic. A rich ranking model costs tens of microseconds per (user, item) pair; scoring a 100M-item catalogue per request would take minutes, far beyond a ~50 ms budget. So you split it: candidate generation is cheap retrieval (ANN, CF lookups, trending) that narrows millions to ~1000 in milliseconds and optimises recall; ranking is expensive scoring over just those candidates and optimises precision. Two stages, each tuned for its job.

Why blend multiple candidate channels instead of one?Reveal

A single source inherits its blind spots and amplifies its bias, collapsing diversity. Embedding ANN gives personalised similarity but can ossify; CF gives taste-neighbourhood signal but suffers popularity bias and cold-start; trending gives timeliness; a fresh channel gives new items a chance. Blending them and deduping into one pool keeps the slate diverse and resilient — and an item retrieved by several channels is a strong signal. "Candidate gen is recall, and recall is a portfolio, not a single bet."

Offline AUC is up but online CTR is flat. Diagnose in order.Reveal

(1) Train/serve feature skew — features computed differently in training (warehouse joins) vs serving (cached online values), so the model sees a shifted distribution online. (2) Selection bias — training on logged impressions teaches "what the old system showed", not true preference. (3) Stale / low-freshness features. (4) Metric mismatch — AUC ≠ CTR ≠ retention. Fix with a feature store (one definition, both paths), serving-time feature logging, shadow eval, and an A/B test rather than trusting offline AUC.

What is the feedback loop and why must you explore?Reveal

A recommender is closed-loop: served slates generate the impressions/clicks that train the next model, which shapes the next slates. A greedy model only gets feedback on items it shows, so it reinforces its existing beliefs and buries everything else — coverage shrinks into a filter bubble, and the logs become biased toward what was already shown ("no impressions, no labels"). A 5–10% exploration budget (epsilon-greedy or Thompson sampling) keeps fresh/uncertain items getting impressions, so the model keeps learning and the training data stays honest.

How is new-user cold-start different from new-item cold-start?Reveal

They need different fallbacks. A new user has no interaction history, so CF/embeddings produce nothing — fall back to popularity-in-segment/context (best items for their region/device/declared interests) and personalise progressively as interactions arrive. A new item has no clicks, but it does have content (title, category, media) — use a content-based embedding to place it near similar items in the ANN space so it's retrievable immediately, plus exploration impressions to earn real signal. Both then graduate to behavioural signal as data accrues.

Why optimise watch time/retention instead of clicks?Reveal

Raw CTR rewards clickbait and accelerates feedback-loop collapse: the model learns to bait clicks rather than satisfy users, and short-term clicks don't predict long-term value. YouTube ranks on expected watch time and Netflix optimises retention precisely because those align with genuine satisfaction. You validate with online A/B tests (not offline AUC, which can move opposite to the business metric) and track diversity and catalogue coverage alongside the headline number so you can see the loop's health.

When to reach for this

Layer

Default

Trade-off

Robust answer includes

Candidate generation

Multi-channel: ANN + CF + trending + fresh

Recall vs diversity

Channel mix, dedupe, weighting

Ranking

GBDT over the feature store

Quality vs latency

Feature freshness, p99 budget, cross features

Features

Feature store, shared train/serve

Freshness vs cost

Skew prevention, versioning, value logging

Exploration

Epsilon-greedy / bandit slot

Short-term CTR vs learning

Coverage, feedback-loop control

Cold start

Popularity-segment (users) + content ANN (items)

Less personalisation early

Separate user and item fallbacks