Technology·Edge & gateway

Load Balancer

The component that spreads incoming traffic across a pool of servers, removes unhealthy ones, and gives clients one stable endpoint — the foundation of horizontal scaling and availability.

Also worth naming: AWS Elastic Load Balancer (ALB / NLB) · NGINX · HAProxy · Envoy · Google Cloud Load Balancing

~25 min read·15 sections

A load balancer is what turns "one server" into "a fleet you can grow, shrink, and lose nodes from without anyone noticing." In most designs you draw one box at the front and reason about L4 vs L7 and the balancing algorithm.

What it is

A load balancer sits in front of a pool of servers and distributes incoming requests across them, so no single machine is overwhelmed and the system can scale horizontally. Clients connect to one stable endpoint (a DNS name / virtual IP); the load balancer decides which backend actually serves each request, continuously health-checks the pool, and stops routing to any server that fails — so a dead node is invisible to users.

That single role unlocks the two pillars of scalable systems. Scalability: you add capacity by adding identical stateless servers behind the LB rather than buying a bigger machine, and the LB spreads load across them. Availability: because the LB routes only to healthy nodes and you run several, any one server (or a whole availability zone) can fail without downtime — the LB drains it and the rest absorb the traffic.

The two decisions that matter in an interview are the layer and the algorithm. A layer-4 LB forwards TCP/UDP by IP and port (fast, protocol-agnostic, great for persistent connections like WebSockets); a layer-7 LB understands HTTP, so it can route by path/host, terminate TLS, and apply per-request logic. The algorithm (round-robin, least-connections, IP-hash) decides which backend gets each request. You will almost always have a load balancer in a design; the substance is choosing these and knowing that backends should be stateless so any server can handle any request.

When to reach for it

Reach for this when…

You run more than one server and need to spread traffic across them
You want horizontal scaling — add/remove identical servers behind one endpoint
You need availability — route around failed nodes and whole zones automatically
You need TLS termination, path/host routing, or sticky persistent connections

Not really this pattern when…

A truly single-instance service with no scaling or HA needs (rare in interviews)
You actually need API-gateway features — auth, rate limiting, request transformation (that is an API gateway, often in front of or fused with the LB)
Pure global geographic routing — that is DNS/anycast (often paired with, not replacing, the LB)

How it works

Three decisions cover almost every interview use:

1. It gives you one endpoint over a changing fleet, with health checks. Clients see a single stable address; behind it the LB tracks a pool of backends, health-checks them continuously, and routes only to healthy ones. You scale by registering more servers and recover from failure by the LB draining dead ones — all invisible to clients. For this to work the backends must be stateless, so any server can serve any request.

Architecture diagram· A load balancer spreads traffic across healthy backends

Clients hit one stable endpoint; the load balancer distributes requests across a pool of identical, stateless servers and removes any that fail health checks. Add or remove servers behind it without clients noticing.

2. L4 vs L7 — the layer decides what it can route on. A layer-4 LB operates on TCP/UDP: it forwards by IP and port without reading the payload, so it is fast, cheap, and protocol-agnostic — the right choice for raw throughput and persistent connections (WebSockets, gRPC streams) where you want a connection pinned to one backend. A layer-7 LB parses HTTP, so it can route by path or host, terminate TLS, do sticky sessions by cookie, retry failed requests, and apply per-request rules — at slightly higher cost. Rule of thumb: persistent connections → L4; flexible HTTP routing → L7.

Architecture diagram· L4 routes by connection; L7 routes by request content

A layer-4 load balancer forwards TCP/UDP by IP and port — fast and protocol-agnostic, ideal for persistent connections. A layer-7 load balancer understands HTTP, so it can route by path or host, terminate TLS, and apply per-request logic.

3. The algorithm decides who gets each request. Round-robin (even rotation) is the simple default; least-connections sends to the least-busy backend (better when request durations vary); least-response-time factors in latency; IP-hash / consistent-hash pins a client (or key) to a backend for cache locality or session affinity. Most systems are fine with round-robin or least-connections; you reach for hashing when you need stickiness.

Two more facts worth a sentence: load balancers are themselves made highly available (redundant pairs, multiple zones) so the LB is not a single point of failure, and an LB is not an API gateway — it distributes load, while a gateway adds auth, rate limiting, and transformation (the two are often layered together).

Performance envelope

Load balancer characteristics — the numbers to quote.

Dimension	L4 vs L7	Why it matters
Throughput	L4: very high (millions conn/s); L7: high	L4 is lighter — no payload parsing
Routing	L4: IP/port; L7: HTTP path, host, header	L7 enables path-based microservice routing
Persistent conns	L4 ideal; L7 also supports w/ config	WebSockets/gRPC streams favour L4
TLS termination	L7 (and modern L4) terminates TLS	Offload crypto from backends
Health checks	Both — remove unhealthy backends	Failure routing is automatic
Algorithm	Round-robin / least-conn / IP-hash	Match to request variance + stickiness needs

Capabilities in interviews

Horizontal scaling

Add identical stateless servers behind one endpoint and the LB spreads load across them.

The foundation of scaling out. Instead of a bigger box, you run N identical app servers behind the LB and grow N with traffic:

text

clients → LB → [app-1, app-2, … app-N]   (autoscaling group)

Because the servers are stateless (session/state lives in Redis or a DB, not on the box), any server handles any request, so the LB can route freely and an autoscaler can add/remove instances based on CPU or request rate. This is what "the service is horizontally scaled" means in a design.

Choose this variant when

Any stateless service tier behind one endpoint
Autoscaling on load
Growing capacity without bigger machines

High availability & failover

Health-check the pool and route around failed servers and whole zones automatically.

The LB continuously probes each backend and removes failing ones from rotation, so a crashed server or a bad deploy instance is drained without user impact:

text

health check /healthz every 5s → mark unhealthy → stop routing → traffic absorbed by the rest

Spread backends across multiple availability zones and the LB routes around a whole zone outage. The LB itself is run as a redundant, multi-zone service so it is not a single point of failure. This is the mechanism behind "no single point of failure" in most designs.

Choose this variant when

Surviving instance and zone failures
Zero-downtime deploys (drain + replace)
Any availability SLA above a single box

L7 routing & TLS termination

Route by path/host to different services and terminate TLS at the edge of your fleet.

An L7 load balancer reads HTTP, so one endpoint can fan out to many services and offload crypto:

text

/api/*   → API service
/img/*   → media service
/ws/*    → realtime service
TLS terminated at the LB; plain HTTP to backends inside the VPC

Path/host routing lets a single public endpoint front a microservice backend, and terminating TLS at the LB frees backends from the handshake cost while centralizing certificate management. This is the everyday shape for a web/API tier.

Choose this variant when

Microservice routing behind one endpoint
Centralized TLS / certificate management
Header/path-based routing, blue-green, canaries

Sticky sessions & persistent connections

Pin a client to a backend for session affinity or to hold a long-lived connection.

Some workloads need a client to stay on one backend. Sticky sessions (cookie or IP-hash) keep a user on the server holding their in-memory session; persistent connections (WebSockets, gRPC streams, SSE) must stay on the backend that holds the open socket:

text

WebSocket: L4 LB + consistent-hash on user_id → reconnect lands on the same gateway

For persistent connections an L4 LB is the natural fit, often with consistent hashing so reconnects return to the owning node. Prefer externalizing session state (so you do not need stickiness) — but when you hold connections, the LB has to support it.

Choose this variant when

WebSocket / gRPC-stream / SSE fleets
Legacy in-memory sessions needing affinity
Cache-locality routing by key

Operating knobs

L4 vs L7

L4 forwards TCP/UDP by IP/port — fastest, protocol-agnostic, best for persistent connections and raw throughput. L7 parses HTTP — path/host routing, TLS termination, cookie stickiness, retries, header rules — at slightly higher cost. Rule of thumb: persistent connections or maximum throughput → L4; flexible HTTP routing and TLS → L7. Many stacks use both (L4 at the edge, L7 per service).

Balancing algorithm

Round-robin for uniform requests; least-connections when request durations vary (avoids piling onto a backend stuck on slow requests); least-response-time to factor latency; IP/consistent-hash for session affinity or cache locality. Default to round-robin or least-connections and reach for hashing only when you need stickiness.

Health checks & draining

Tune health-check path, interval, and unhealthy/healthy thresholds so failures are caught fast without flapping on a transient blip. Connection draining lets in-flight requests finish before a server is removed (for deploys/scale-in), enabling zero-downtime rollouts.

Statelessness & session strategy

The LB works cleanly only if backends are stateless — store session/state in Redis or a DB so any server can serve any request. If you must keep in-memory session, use sticky sessions, accepting that losing a server loses those sessions. Externalizing state is almost always the better design.

Versus the alternatives

Load balancer vs adjacent components.

Dimension	Load balancer	API gateway	CDN
Primary job	Distribute load across servers	Auth, rate limit, route, transform	Cache content near users
Layer	L4 (transport) or L7 (HTTP)	L7 (application)	Edge (HTTP)
Adds logic?	Minimal — routing + health	Cross-cutting request logic	Caching + edge compute
Position	In front of a server pool	In front of services (often after LB)	In front of everything, globally
Caches?	No	Sometimes (responses)	Yes — that is its job

Failure modes & gotchas

Stateful backends behind the LB

If servers keep session or other state in local memory, a user's requests must always hit the same box — defeating free distribution and losing the session when that box dies. Make backends stateless (state in Redis/DB) so any server serves any request; use sticky sessions only as a last resort.

The load balancer as a single point of failure

A single LB instance is a SPOF — if it dies, everything behind it is unreachable. Run the LB as a redundant, multi-AZ service (cloud LBs do this for you; self-managed needs an HA pair + failover). The thing that provides availability must itself be available.

Round-robin onto uneven request loadsAdvanced

Round-robin assumes requests cost the same; with mixed cheap/expensive requests it can pile slow work onto a backend while others idle. Use least-connections (or least-response-time) when request durations vary, so busy backends receive fewer new requests.

Per-message round-robin in front of persistent connectionsAdvanced

WebSocket/gRPC-stream traffic pinned to one socket cannot be round-robined per message — each message must reach the backend holding the connection. Use an L4 LB (and consistent hashing for reconnects) and route messages to the owning node via a separate fan-out tier, not the LB.

Health checks too lax or too aggressiveAdvanced

Too lax and the LB keeps sending traffic to a half-dead server; too aggressive and it flaps healthy servers out on a transient blip, shrinking capacity. Tune the path, interval, and thresholds, and use a real readiness endpoint that reflects the server's actual ability to serve.

In production

Google

Maglev — software load balancers serving Google's traffic

Google fronts its public services with Maglev, a software network load balancer running on commodity machines rather than expensive specialized hardware. A pool of Maglev machines shares a service IP via ECMP routing, and each one uses consistent hashing to map a connection to a backend — so even as Maglev machines or backends come and go, existing connections stay pinned to the same backend (the property persistent connections need). A single Maglev machine saturates a 10 Gbps link.

The lessons map directly to the interview: load balancing is the layer that turns "a fleet of identical backends" into "one scalable, available endpoint"; consistent hashing gives connection stability without per-connection state; and you make the balancer itself horizontally scalable and redundant (a pool sharing the IP) so it is never a single point of failure.

Takeaway: Make the load-balancer tier itself a scalable, redundant pool (shared IP + consistent hashing) so it routes around failure and is never the single point of failure.

AWS

Elastic Load Balancing — scaling from one box to millions of requests

AWS's Elastic Load Balancing is the managed embodiment of this component, and the ALB (L7) vs NLB (L4) split is the exact decision from this page. The Application Load Balancer does HTTP path/host routing, TLS termination, and sticky sessions — the default for web/API tiers. The Network Load Balancer operates at L4 (TCP/UDP), handles millions of requests per second with ultra-low latency, and is the choice for persistent connections (WebSockets, gRPC) and raw throughput.

Both are themselves distributed, multi-AZ services with health checks and connection draining built in — so a backend instance or an entire availability zone can fail and the LB simply stops routing to it, with no single LB instance to lose. The takeaway: in a design you draw the LB as one box, but mentally it's a redundant multi-AZ service, and you choose L7 vs L4 by whether you need HTTP routing or persistent-connection throughput.

Takeaway: Pick L7 (ALB) for HTTP routing + TLS, L4 (NLB) for persistent connections + raw throughput — and rely on the LB being a redundant multi-AZ service with health checks, not one box.

Good vs bad answer

Interviewer probe

“Your API runs on one server that is maxing out. How do you scale it and keep it available?”

Weak answer

"Upgrade to a bigger server with more CPU and memory so it can handle the load, and take regular backups in case it goes down."

Strong answer

"Scale out, not up: put a load balancer in front and run multiple identical, stateless app servers behind it — session and state go in Redis/the DB so any server can serve any request. The LB gives clients one stable endpoint, spreads requests across the pool (least-connections, since API request costs vary), and health-checks each backend so a dead instance is drained automatically. I'd spread the servers across multiple availability zones so the LB routes around a whole zone outage, and run the LB itself as a redundant multi-AZ service so it isn't a single point of failure. An autoscaling group adds and removes servers based on load. This is an L7 LB so it can terminate TLS and route /api paths, with connection draining for zero-downtime deploys. A single bigger server just raises the ceiling and stays a single point of failure — horizontal scaling behind an LB gives me both elastic capacity and availability."

Why it wins: Chooses scale-out with a stateless tier, picks the algorithm and layer with reasons, makes the design multi-AZ and the LB itself HA, adds autoscaling and zero-downtime draining, and explains why scaling up is the wrong lever.

Interview playbook

Interview playbook1–2 min — usually a quick, confident component; deeper on persistent-connection or HA questions

When it comes up

Any tier that needs to handle more than one server's worth of traffic
"How does this scale / stay available?" — the LB is the first answer
WebSocket / persistent-connection fleets (L4 + stickiness)
Microservice routing or TLS termination behind one endpoint (L7)

Order of reveal

1
1. One endpoint, stateless pool. A load balancer fronts a pool of identical stateless servers; clients see one endpoint and I scale by adding servers.
2
2. Health checks for availability. It health-checks the pool and drains failed nodes, and I spread servers across AZs so a zone can fail.
3
3. Pick the layer. L7 for HTTP path/host routing and TLS; L4 for raw throughput or persistent connections like WebSockets.
4
4. Pick the algorithm. Least-connections when request costs vary; round-robin for uniform; consistent-hash for stickiness.
5
5. Make the LB itself HA. Run the LB redundant across zones so the thing providing availability is not a single point of failure.

Signature phrases

“Scale out behind a load balancer, not up.”

“Backends are stateless, so any server serves any request.”

“Persistent connections → L4; flexible HTTP routing → L7.”

“Health checks plus multi-AZ means a node or zone can fail invisibly.”

“Scale out behind a load balancer, not up.” — The core horizontal-scaling instinct.
“Backends are stateless, so any server serves any request.” — The precondition that makes load balancing work.
“Persistent connections → L4; flexible HTTP routing → L7.” — The crisp layer decision.
“Health checks plus multi-AZ means a node or zone can fail invisibly.” — Ties the LB to availability.

Likely follow-ups

?“L4 or L7 — how do you choose?”Reveal

By what I need to route on and whether connections are persistent. L7 if I want HTTP-aware features — route by path or host to different services, terminate TLS, sticky sessions by cookie, retries, header-based canaries — which is the default for a web/API tier. L4 if I need maximum throughput with minimal overhead, or I am carrying persistent connections like WebSockets or gRPC streams where I want the connection pinned to one backend and the LB just forwarding TCP. Many architectures use both: an L4 LB at the edge for raw ingress and L7 routing per service behind it.

?“How does the load balancer not become a single point of failure?”Reveal

You run it redundantly. Managed cloud load balancers (ALB/NLB, Google's) are themselves distributed, multi-AZ services with a stable DNS/anycast front, so there is no single instance to lose. Self-managed (NGINX/HAProxy) you run as an active-passive or active-active pair across zones with health-checked failover (e.g. a floating IP / keepalived), and DNS or anycast in front for the public entrypoint. The principle: the component that provides availability has to be at least as available as what it fronts, so you never deploy exactly one of it.

?“A WebSocket service sits behind the LB — what changes?”Reveal

A persistent connection pins a client to one backend for the connection's life, so you cannot round-robin per message. I use an L4 load balancer (or session affinity) to place the connection, and consistent-hash on user_id so a reconnect lands on the same gateway. Per-message delivery to a recipient is not the LB's job — a separate fan-out tier looks up which gateway holds that user's connection (via a presence registry) and routes the message there. And because rolling deploys drop connections, clients reconnect with exponential backoff and jitter. The LB places connections; it does not move messages between them.

Worked example

Setup. A single API server is maxing out under growing traffic. Scale it to handle 10× the load and survive an instance — or a whole availability zone — failing, with zero-downtime deploys.

The move. Scale out, not up: put a load balancer in front of a pool of identical stateless app servers. Session and state live in Redis / the database, not on the box, so any server can serve any request — which is the precondition that lets the LB route freely and an autoscaler add/remove instances. Clients see one stable endpoint; the LB spreads requests across the pool.

Layer + algorithm. This is an HTTP API, so an L7 load balancer — it terminates TLS (offloading crypto from the backends) and can route /api paths to the right service. For the algorithm I pick least-connections rather than round-robin, because API request costs vary and least-connections steers new requests away from a backend stuck on slow ones.

Availability. I spread the servers across multiple availability zones and configure health checks (a real /healthz readiness probe), so the LB drains a dead instance automatically and routes around a whole AZ outage. Critically, the LB itself is run redundant across zones (managed ELB, or an active-active HAProxy pair) so the thing providing availability isn't a single point of failure.

Elasticity + deploys. An autoscaling group grows the pool on CPU/request-rate. Connection draining lets in-flight requests finish before an instance is removed, enabling zero-downtime rolling deploys.

What breaks. The classic mistake is stateful backends — in-memory sessions force sticky routing and lose state when a box dies; externalizing state to Redis fixes it. For WebSockets I'd switch to an L4 LB with consistent hashing so a reconnect lands on the gateway holding the socket.

The result. Elastic horizontal capacity behind one endpoint, automatic failure routing across instances and zones, zero-downtime deploys, and an LB tier that is itself highly available — the foundation of every scalable, available service.

Cheat sheet

•Load balancer = one stable endpoint over a pool of stateless servers, with health checks.
•Enables horizontal scaling (add servers) and availability (route around failed nodes/zones).
•L4 = TCP/UDP by IP:port — fast, persistent connections (WebSockets). L7 = HTTP path/host, TLS, stickiness.
•Algorithms: round-robin (uniform), least-connections (varied cost), consistent-hash (stickiness/locality).
•Backends must be stateless — session/state in Redis or DB, not local memory.
•Health checks + multi-AZ backends = a node or whole zone can fail invisibly.
•Run the LB itself redundantly so it is not a single point of failure.
•An LB distributes load; an API gateway adds auth/rate-limit/transform — often layered together.

Drills

Why must servers behind a load balancer be stateless?Reveal

Because the load balancer is free to send any request to any backend. If a server keeps session or other state in local memory, then a user's later requests must return to that exact server (forcing sticky sessions and defeating even distribution), and if that server dies the state is lost. Keeping servers stateless — with session/state in Redis or a database — lets the LB route freely, lets an autoscaler add/remove instances at will, and makes a server failure a non-event. Statelessness is the precondition that makes horizontal scaling clean.

Interviewer: "round-robin or least-connections for an API with mixed cheap and expensive endpoints?"Reveal

Least-connections. Round-robin assumes every request costs roughly the same, so it keeps handing new requests to a backend even if that backend is stuck processing several slow/expensive requests while others sit idle — load gets uneven. Least-connections routes each new request to the backend currently handling the fewest, which naturally steers traffic away from busy servers and toward idle ones when request durations vary. Least-response-time goes further by factoring in observed latency. Round-robin is fine only when requests are uniform.

Where does an API gateway fit relative to the load balancer?Reveal

They do different jobs and are often layered. The load balancer distributes traffic across server instances and handles health/failover — it is about where a request goes. The API gateway is an L7 entrypoint that applies cross-cutting request logic — authentication, rate limiting, routing to the right microservice, request/response transformation. A common arrangement is clients → load balancer → API gateway → services, or a managed gateway that has load balancing built in. In an interview, draw the LB for scaling/availability and the gateway when you need auth/rate-limit/routing; do not conflate them.

What it is

Load Balancer

The component that spreads incoming traffic across a pool of servers, removes unhealthy ones, and gives clients one stable endpoint — the foundation of horizontal scaling and availability.

Also worth naming: AWS Elastic Load Balancer (ALB / NLB) · NGINX · HAProxy · Envoy · Google Cloud Load Balancing

~25 min read·15 sections

Dimension

L4 vs L7

Why it matters

Throughput

L4: very high (millions conn/s); L7: high

L4 is lighter — no payload parsing

Routing

L4: IP/port; L7: HTTP path, host, header

L7 enables path-based microservice routing

Persistent conns

L4 ideal; L7 also supports w/ config

WebSockets/gRPC streams favour L4

TLS termination

L7 (and modern L4) terminates TLS

Offload crypto from backends

Health checks

Both — remove unhealthy backends

Failure routing is automatic

Algorithm

Round-robin / least-conn / IP-hash

Match to request variance + stickiness needs

Dimension

Load balancer

API gateway

CDN

Primary job

Distribute load across servers

Auth, rate limit, route, transform

Cache content near users

Layer

L4 (transport) or L7 (HTTP)

L7 (application)

Edge (HTTP)

Adds logic?

Minimal — routing + health

Cross-cutting request logic

Caching + edge compute

Position

In front of a server pool

In front of services (often after LB)

In front of everything, globally

Caches?

Sometimes (responses)

Yes — that is its job