Load Balancer
The component that spreads incoming traffic across a pool of servers, removes unhealthy ones, and gives clients one stable endpoint — the foundation of horizontal scaling and availability.
Also worth naming: AWS Elastic Load Balancer (ALB / NLB) · NGINX · HAProxy · Envoy · Google Cloud Load Balancing
A load balancer is what turns "one server" into "a fleet you can grow, shrink, and lose nodes from without anyone noticing." In most designs you draw one box at the front and reason about L4 vs L7 and the balancing algorithm.
What it is
A load balancer sits in front of a pool of servers and distributes incoming requests across them, so no single machine is overwhelmed and the system can scale horizontally. Clients connect to one stable endpoint (a DNS name / virtual IP); the load balancer decides which backend actually serves each request, continuously health-checks the pool, and stops routing to any server that fails — so a dead node is invisible to users.
That single role unlocks the two pillars of scalable systems. Scalability: you add capacity by adding identical stateless servers behind the LB rather than buying a bigger machine, and the LB spreads load across them. Availability: because the LB routes only to healthy nodes and you run several, any one server (or a whole availability zone) can fail without downtime — the LB drains it and the rest absorb the traffic.
The two decisions that matter in an interview are the layer and the algorithm. A layer-4 LB forwards TCP/UDP by IP and port (fast, protocol-agnostic, great for persistent connections like WebSockets); a layer-7 LB understands HTTP, so it can route by path/host, terminate TLS, and apply per-request logic. The algorithm (round-robin, least-connections, IP-hash) decides which backend gets each request. You will almost always have a load balancer in a design; the substance is choosing these and knowing that backends should be stateless so any server can handle any request.
When to reach for it
Reach for this when…
- You run more than one server and need to spread traffic across them
- You want horizontal scaling — add/remove identical servers behind one endpoint
- You need availability — route around failed nodes and whole zones automatically
- You need TLS termination, path/host routing, or sticky persistent connections
Not really this pattern when…
- A truly single-instance service with no scaling or HA needs (rare in interviews)
- You actually need API-gateway features — auth, rate limiting, request transformation (that is an API gateway, often in front of or fused with the LB)
- Pure global geographic routing — that is DNS/anycast (often paired with, not replacing, the LB)
How it works
Three decisions cover almost every interview use:
1. It gives you one endpoint over a changing fleet, with health checks. Clients see a single stable address; behind it the LB tracks a pool of backends, health-checks them continuously, and routes only to healthy ones. You scale by registering more servers and recover from failure by the LB draining dead ones — all invisible to clients. For this to work the backends must be stateless, so any server can serve any request.
Clients hit one stable endpoint; the load balancer distributes requests across a pool of identical, stateless servers and removes any that fail health checks. Add or remove servers behind it without clients noticing.
2. L4 vs L7 — the layer decides what it can route on. A layer-4 LB operates on TCP/UDP: it forwards by IP and port without reading the payload, so it is fast, cheap, and protocol-agnostic — the right choice for raw throughput and persistent connections (WebSockets, gRPC streams) where you want a connection pinned to one backend. A layer-7 LB parses HTTP, so it can route by path or host, terminate TLS, do sticky sessions by cookie, retry failed requests, and apply per-request rules — at slightly higher cost. Rule of thumb: persistent connections → L4; flexible HTTP routing → L7.
A layer-4 load balancer forwards TCP/UDP by IP and port — fast and protocol-agnostic, ideal for persistent connections. A layer-7 load balancer understands HTTP, so it can route by path or host, terminate TLS, and apply per-request logic.
3. The algorithm decides who gets each request. Round-robin (even rotation) is the simple default; least-connections sends to the least-busy backend (better when request durations vary); least-response-time factors in latency; IP-hash / consistent-hash pins a client (or key) to a backend for cache locality or session affinity. Most systems are fine with round-robin or least-connections; you reach for hashing when you need stickiness.
Two more facts worth a sentence: load balancers are themselves made highly available (redundant pairs, multiple zones) so the LB is not a single point of failure, and an LB is not an API gateway — it distributes load, while a gateway adds auth, rate limiting, and transformation (the two are often layered together).
Performance envelope
Load balancer characteristics — the numbers to quote.
| Dimension | L4 vs L7 | Why it matters |
|---|---|---|
| Throughput | L4: very high (millions conn/s); L7: high | L4 is lighter — no payload parsing |
| Routing | L4: IP/port; L7: HTTP path, host, header | L7 enables path-based microservice routing |
| Persistent conns | L4 ideal; L7 also supports w/ config | WebSockets/gRPC streams favour L4 |
| TLS termination | L7 (and modern L4) terminates TLS | Offload crypto from backends |
| Health checks | Both — remove unhealthy backends | Failure routing is automatic |
| Algorithm | Round-robin / least-conn / IP-hash | Match to request variance + stickiness needs |
Capabilities in interviews
Horizontal scaling
Add identical stateless servers behind one endpoint and the LB spreads load across them.
The foundation of scaling out. Instead of a bigger box, you run N identical app servers behind the LB and grow N with traffic:
clients → LB → [app-1, app-2, … app-N] (autoscaling group)Because the servers are stateless (session/state lives in Redis or a DB, not on the box), any server handles any request, so the LB can route freely and an autoscaler can add/remove instances based on CPU or request rate. This is what "the service is horizontally scaled" means in a design.
Choose this variant when
- Any stateless service tier behind one endpoint
- Autoscaling on load
- Growing capacity without bigger machines
High availability & failover
Health-check the pool and route around failed servers and whole zones automatically.
The LB continuously probes each backend and removes failing ones from rotation, so a crashed server or a bad deploy instance is drained without user impact:
health check /healthz every 5s → mark unhealthy → stop routing → traffic absorbed by the restSpread backends across multiple availability zones and the LB routes around a whole zone outage. The LB itself is run as a redundant, multi-zone service so it is not a single point of failure. This is the mechanism behind "no single point of failure" in most designs.
Choose this variant when
- Surviving instance and zone failures
- Zero-downtime deploys (drain + replace)
- Any availability SLA above a single box
L7 routing & TLS termination
Route by path/host to different services and terminate TLS at the edge of your fleet.
An L7 load balancer reads HTTP, so one endpoint can fan out to many services and offload crypto:
/api/* → API service
/img/* → media service
/ws/* → realtime service
TLS terminated at the LB; plain HTTP to backends inside the VPCPath/host routing lets a single public endpoint front a microservice backend, and terminating TLS at the LB frees backends from the handshake cost while centralizing certificate management. This is the everyday shape for a web/API tier.
Choose this variant when
- Microservice routing behind one endpoint
- Centralized TLS / certificate management
- Header/path-based routing, blue-green, canaries
Sticky sessions & persistent connections
Pin a client to a backend for session affinity or to hold a long-lived connection.
Some workloads need a client to stay on one backend. Sticky sessions (cookie or IP-hash) keep a user on the server holding their in-memory session; persistent connections (WebSockets, gRPC streams, SSE) must stay on the backend that holds the open socket:
WebSocket: L4 LB + consistent-hash on user_id → reconnect lands on the same gatewayFor persistent connections an L4 LB is the natural fit, often with consistent hashing so reconnects return to the owning node. Prefer externalizing session state (so you do not need stickiness) — but when you hold connections, the LB has to support it.
Choose this variant when
- WebSocket / gRPC-stream / SSE fleets
- Legacy in-memory sessions needing affinity
- Cache-locality routing by key
Operating knobs
L4 vs L7
L4 forwards TCP/UDP by IP/port — fastest, protocol-agnostic, best for persistent connections and raw throughput. L7 parses HTTP — path/host routing, TLS termination, cookie stickiness, retries, header rules — at slightly higher cost. Rule of thumb: persistent connections or maximum throughput → L4; flexible HTTP routing and TLS → L7. Many stacks use both (L4 at the edge, L7 per service).
Balancing algorithm
Round-robin for uniform requests; least-connections when request durations vary (avoids piling onto a backend stuck on slow requests); least-response-time to factor latency; IP/consistent-hash for session affinity or cache locality. Default to round-robin or least-connections and reach for hashing only when you need stickiness.
Health checks & draining
Tune health-check path, interval, and unhealthy/healthy thresholds so failures are caught fast without flapping on a transient blip. Connection draining lets in-flight requests finish before a server is removed (for deploys/scale-in), enabling zero-downtime rollouts.
Statelessness & session strategy
The LB works cleanly only if backends are stateless — store session/state in Redis or a DB so any server can serve any request. If you must keep in-memory session, use sticky sessions, accepting that losing a server loses those sessions. Externalizing state is almost always the better design.
Versus the alternatives
Load balancer vs adjacent components.
| Dimension | Load balancer | API gateway | CDN |
|---|---|---|---|
| Primary job | Distribute load across servers | Auth, rate limit, route, transform | Cache content near users |
| Layer | L4 (transport) or L7 (HTTP) | L7 (application) | Edge (HTTP) |
| Adds logic? | Minimal — routing + health | Cross-cutting request logic | Caching + edge compute |
| Position | In front of a server pool | In front of services (often after LB) | In front of everything, globally |
| Caches? | No | Sometimes (responses) | Yes — that is its job |
Failure modes & gotchas
If servers keep session or other state in local memory, a user's requests must always hit the same box — defeating free distribution and losing the session when that box dies. Make backends stateless (state in Redis/DB) so any server serves any request; use sticky sessions only as a last resort.
A single LB instance is a SPOF — if it dies, everything behind it is unreachable. Run the LB as a redundant, multi-AZ service (cloud LBs do this for you; self-managed needs an HA pair + failover). The thing that provides availability must itself be available.
Round-robin assumes requests cost the same; with mixed cheap/expensive requests it can pile slow work onto a backend while others idle. Use least-connections (or least-response-time) when request durations vary, so busy backends receive fewer new requests.
WebSocket/gRPC-stream traffic pinned to one socket cannot be round-robined per message — each message must reach the backend holding the connection. Use an L4 LB (and consistent hashing for reconnects) and route messages to the owning node via a separate fan-out tier, not the LB.
Too lax and the LB keeps sending traffic to a half-dead server; too aggressive and it flaps healthy servers out on a transient blip, shrinking capacity. Tune the path, interval, and thresholds, and use a real readiness endpoint that reflects the server's actual ability to serve.
In production
Maglev — software load balancers serving Google's traffic
Google fronts its public services with Maglev, a software network load balancer running on commodity machines rather than expensive specialized hardware. A pool of Maglev machines shares a service IP via ECMP routing, and each one uses consistent hashing to map a connection to a backend — so even as Maglev machines or backends come and go, existing connections stay pinned to the same backend (the property persistent connections need). A single Maglev machine saturates a 10 Gbps link.
The lessons map directly to the interview: load balancing is the layer that turns "a fleet of identical backends" into "one scalable, available endpoint"; consistent hashing gives connection stability without per-connection state; and you make the balancer itself horizontally scalable and redundant (a pool sharing the IP) so it is never a single point of failure.
AWS
Elastic Load Balancing — scaling from one box to millions of requests
AWS's Elastic Load Balancing is the managed embodiment of this component, and the ALB (L7) vs NLB (L4) split is the exact decision from this page. The Application Load Balancer does HTTP path/host routing, TLS termination, and sticky sessions — the default for web/API tiers. The Network Load Balancer operates at L4 (TCP/UDP), handles millions of requests per second with ultra-low latency, and is the choice for persistent connections (WebSockets, gRPC) and raw throughput.
Both are themselves distributed, multi-AZ services with health checks and connection draining built in — so a backend instance or an entire availability zone can fail and the LB simply stops routing to it, with no single LB instance to lose. The takeaway: in a design you draw the LB as one box, but mentally it's a redundant multi-AZ service, and you choose L7 vs L4 by whether you need HTTP routing or persistent-connection throughput.
Good vs bad answer
Interviewer probe
“Your API runs on one server that is maxing out. How do you scale it and keep it available?”
Weak answer
"Upgrade to a bigger server with more CPU and memory so it can handle the load, and take regular backups in case it goes down."
Strong answer
"Scale out, not up: put a load balancer in front and run multiple identical, stateless app servers behind it — session and state go in Redis/the DB so any server can serve any request. The LB gives clients one stable endpoint, spreads requests across the pool (least-connections, since API request costs vary), and health-checks each backend so a dead instance is drained automatically. I'd spread the servers across multiple availability zones so the LB routes around a whole zone outage, and run the LB itself as a redundant multi-AZ service so it isn't a single point of failure. An autoscaling group adds and removes servers based on load. This is an L7 LB so it can terminate TLS and route /api paths, with connection draining for zero-downtime deploys. A single bigger server just raises the ceiling and stays a single point of failure — horizontal scaling behind an LB gives me both elastic capacity and availability."
Why it wins: Chooses scale-out with a stateless tier, picks the algorithm and layer with reasons, makes the design multi-AZ and the LB itself HA, adds autoscaling and zero-downtime draining, and explains why scaling up is the wrong lever.
Interview playbook
When it comes up
- Any tier that needs to handle more than one server's worth of traffic
- "How does this scale / stay available?" — the LB is the first answer
- WebSocket / persistent-connection fleets (L4 + stickiness)
- Microservice routing or TLS termination behind one endpoint (L7)
Order of reveal
- 11. One endpoint, stateless pool. A load balancer fronts a pool of identical stateless servers; clients see one endpoint and I scale by adding servers.
- 22. Health checks for availability. It health-checks the pool and drains failed nodes, and I spread servers across AZs so a zone can fail.
- 33. Pick the layer. L7 for HTTP path/host routing and TLS; L4 for raw throughput or persistent connections like WebSockets.
- 44. Pick the algorithm. Least-connections when request costs vary; round-robin for uniform; consistent-hash for stickiness.
- 55. Make the LB itself HA. Run the LB redundant across zones so the thing providing availability is not a single point of failure.
Signature phrases
- “Scale out behind a load balancer, not up.” — The core horizontal-scaling instinct.
- “Backends are stateless, so any server serves any request.” — The precondition that makes load balancing work.
- “Persistent connections → L4; flexible HTTP routing → L7.” — The crisp layer decision.
- “Health checks plus multi-AZ means a node or zone can fail invisibly.” — Ties the LB to availability.
Likely follow-ups
?“L4 or L7 — how do you choose?”Reveal
By what I need to route on and whether connections are persistent. L7 if I want HTTP-aware features — route by path or host to different services, terminate TLS, sticky sessions by cookie, retries, header-based canaries — which is the default for a web/API tier. L4 if I need maximum throughput with minimal overhead, or I am carrying persistent connections like WebSockets or gRPC streams where I want the connection pinned to one backend and the LB just forwarding TCP. Many architectures use both: an L4 LB at the edge for raw ingress and L7 routing per service behind it.
?“How does the load balancer not become a single point of failure?”Reveal
You run it redundantly. Managed cloud load balancers (ALB/NLB, Google's) are themselves distributed, multi-AZ services with a stable DNS/anycast front, so there is no single instance to lose. Self-managed (NGINX/HAProxy) you run as an active-passive or active-active pair across zones with health-checked failover (e.g. a floating IP / keepalived), and DNS or anycast in front for the public entrypoint. The principle: the component that provides availability has to be at least as available as what it fronts, so you never deploy exactly one of it.
?“A WebSocket service sits behind the LB — what changes?”Reveal
A persistent connection pins a client to one backend for the connection's life, so you cannot round-robin per message. I use an L4 load balancer (or session affinity) to place the connection, and consistent-hash on user_id so a reconnect lands on the same gateway. Per-message delivery to a recipient is not the LB's job — a separate fan-out tier looks up which gateway holds that user's connection (via a presence registry) and routes the message there. And because rolling deploys drop connections, clients reconnect with exponential backoff and jitter. The LB places connections; it does not move messages between them.
Worked example
Setup. A single API server is maxing out under growing traffic. Scale it to handle 10× the load and survive an instance — or a whole availability zone — failing, with zero-downtime deploys.
The move. Scale out, not up: put a load balancer in front of a pool of identical stateless app servers. Session and state live in Redis / the database, not on the box, so any server can serve any request — which is the precondition that lets the LB route freely and an autoscaler add/remove instances. Clients see one stable endpoint; the LB spreads requests across the pool.
Layer + algorithm. This is an HTTP API, so an L7 load balancer — it terminates TLS (offloading crypto from the backends) and can route /api paths to the right service. For the algorithm I pick least-connections rather than round-robin, because API request costs vary and least-connections steers new requests away from a backend stuck on slow ones.
Availability. I spread the servers across multiple availability zones and configure health checks (a real /healthz readiness probe), so the LB drains a dead instance automatically and routes around a whole AZ outage. Critically, the LB itself is run redundant across zones (managed ELB, or an active-active HAProxy pair) so the thing providing availability isn't a single point of failure.
Elasticity + deploys. An autoscaling group grows the pool on CPU/request-rate. Connection draining lets in-flight requests finish before an instance is removed, enabling zero-downtime rolling deploys.
What breaks. The classic mistake is stateful backends — in-memory sessions force sticky routing and lose state when a box dies; externalizing state to Redis fixes it. For WebSockets I'd switch to an L4 LB with consistent hashing so a reconnect lands on the gateway holding the socket.
The result. Elastic horizontal capacity behind one endpoint, automatic failure routing across instances and zones, zero-downtime deploys, and an LB tier that is itself highly available — the foundation of every scalable, available service.
Cheat sheet
- •Load balancer = one stable endpoint over a pool of stateless servers, with health checks.
- •Enables horizontal scaling (add servers) and availability (route around failed nodes/zones).
- •L4 = TCP/UDP by IP:port — fast, persistent connections (WebSockets). L7 = HTTP path/host, TLS, stickiness.
- •Algorithms: round-robin (uniform), least-connections (varied cost), consistent-hash (stickiness/locality).
- •Backends must be stateless — session/state in Redis or DB, not local memory.
- •Health checks + multi-AZ backends = a node or whole zone can fail invisibly.
- •Run the LB itself redundantly so it is not a single point of failure.
- •An LB distributes load; an API gateway adds auth/rate-limit/transform — often layered together.
Drills
Why must servers behind a load balancer be stateless?Reveal
Because the load balancer is free to send any request to any backend. If a server keeps session or other state in local memory, then a user's later requests must return to that exact server (forcing sticky sessions and defeating even distribution), and if that server dies the state is lost. Keeping servers stateless — with session/state in Redis or a database — lets the LB route freely, lets an autoscaler add/remove instances at will, and makes a server failure a non-event. Statelessness is the precondition that makes horizontal scaling clean.
Interviewer: "round-robin or least-connections for an API with mixed cheap and expensive endpoints?"Reveal
Least-connections. Round-robin assumes every request costs roughly the same, so it keeps handing new requests to a backend even if that backend is stuck processing several slow/expensive requests while others sit idle — load gets uneven. Least-connections routes each new request to the backend currently handling the fewest, which naturally steers traffic away from busy servers and toward idle ones when request durations vary. Least-response-time goes further by factoring in observed latency. Round-robin is fine only when requests are uniform.
Where does an API gateway fit relative to the load balancer?Reveal
They do different jobs and are often layered. The load balancer distributes traffic across server instances and handles health/failover — it is about where a request goes. The API gateway is an L7 entrypoint that applies cross-cutting request logic — authentication, rate limiting, routing to the right microservice, request/response transformation. A common arrangement is clients → load balancer → API gateway → services, or a managed gateway that has load balancing built in. In an interview, draw the LB for scaling/availability and the gateway when you need auth/rate-limit/routing; do not conflate them.
What it is