PostgreSQL
The sensible default database for almost every product-design interview — ACID, rich SQL, JSONB, full-text and geo extensions, and far more scale on one node than candidates assume.
Also worth naming: Amazon RDS / Aurora for PostgreSQL · Google Cloud SQL · Supabase · Neon · Citus (sharded)
The strongest move with Postgres is refusing to over-engineer: a single well-tuned instance handles tens of thousands of writes per second and terabytes of data, so "I'll start with Postgres and add replicas / partitioning when the numbers demand it" is a senior answer, not a naive one.
What it is
PostgreSQL is a relational, ACID-compliant database. Data lives in tables of typed rows; you query it with SQL, including joins, common table expressions, and window functions; and transactions are atomic, consistent, isolated, and durable by default. When two operations must both happen or neither (create a user and their first post, move money between accounts), a transaction guarantees it.
Its reputation as "boring" is the point — it is predictable, battle-tested, and broadly useful. But modern Postgres is also quietly versatile: a JSONB column gives you schemaless documents inside a relational table; GIN indexes power full-text search; the PostGIS extension does geospatial; INSERT ... ON CONFLICT gives clean idempotent upserts; and it scales reads with replicas and big tables with native partitioning.
The interview framing: Postgres is the default you justify by its properties, not by a SQL-vs-NoSQL debate. "I'll use Postgres because its ACID guarantees let me keep the ledger consistent" is a strong opener. You only move off it when a specific requirement — write volume beyond one primary, key-access at massive scale with minimal ops, or a workload that is genuinely not relational — forces the change.
When to reach for it
Reach for this when…
- The default for product-design interviews — user records, orders, relationships, anything transactional
- You need ACID guarantees (payments, bookings, inventory, anything where partial writes corrupt state)
- The data is relational — entities with relationships you will query across with joins
- You want one store that can also do JSON, full-text search, and geo without bolting on a new system
Not really this pattern when…
- Sustained write throughput exceeds what one primary can take and you do not want to shard manually (consider DynamoDB / Cassandra)
- You need seamless horizontal scale with minimal ops and a strictly key-based access pattern (DynamoDB)
- You need sub-millisecond hot-path reads or coordination primitives (put Redis in front)
- You need a high-throughput durable event log with replay (Kafka, not a DB table)
How it works
Four mechanisms explain how Postgres behaves:
1. WAL — durability comes from the log, not the table. Every change is first written to the write-ahead log and fsynced; only then is it considered committed. The table pages are updated in memory (the buffer cache) and flushed lazily. This is why a crash recovers cleanly (replay the WAL) and why the WAL is also the replication feed.
A commit is durable once it is in the write-ahead log (WAL), not when the table page is updated. The WAL is also what streaming replicas tail, so it underpins both durability and replication.
2. MVCC — readers and writers do not block each other. An UPDATE does not overwrite a row in place; it writes a new version and marks the old one dead. Each transaction sees a consistent snapshot based on transaction ids, so a long read never blocks a write and vice versa. The cost is bloat: dead row versions accumulate and VACUUM (usually autovacuum) must reclaim them. A long-running transaction holds back vacuum and is the classic source of table bloat.
An UPDATE writes a new row version rather than overwriting in place. Each version is stamped with the transaction id that created/deleted it, so every transaction sees a consistent snapshot. The cost is dead tuples that VACUUM must reclaim.
3. Indexes + the planner. Postgres has a cost-based query planner that chooses how to execute a query. Your job is to give it the right indexes: B-tree for equality and range (the default), GIN for full-text and JSONB containment, GiST for geospatial / ranges, BRIN for huge naturally-ordered tables (time-series). EXPLAIN ANALYZE shows what the planner actually did — a sequential scan on a large hot query is the smell to fix.
4. One primary, many replicas. All writes go to a single primary; reads scale out to streaming replicas that tail the WAL. Replication is asynchronous by default, so replicas lag by milliseconds to seconds — fine for most reads, but a "read-your-own-write" flow must either read from the primary or wait for the replica to catch up.
Reads fan out to async streaming replicas; all writes go to the single primary. This scales reads horizontally but not writes, and replicas can serve slightly stale data.
Performance envelope
PostgreSQL performance envelope — the numbers to quote.
| Dimension | Number | Why it matters |
|---|---|---|
| Write throughput | ~10K–50K TPS on one primary | Far more than candidates assume — do not propose sharding at 1K QPS |
| Storage per instance | Comfortable into the 10 TB+ range | Vertical scale goes a long way before partitioning or sharding |
| Read scaling | Linear with N read replicas | Reads scale horizontally; the primary does not |
| Write scaling | Single primary is the ceiling | When writes outgrow one node you partition or shard — the real limit |
| Connections | Hundreds before pain | Each connection is a process; front it with PgBouncer to pool thousands of clients |
| Replica lag | Milliseconds to seconds (async) | Replicas can serve stale reads — mind read-your-writes |
Capabilities in interviews
ACID transactions
Group multiple writes into one atomic, all-or-nothing unit — the core reason to pick Postgres.
When several changes must succeed or fail together, wrap them in a transaction:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- both or neither; a crash mid-way rolls backThis is the property you name explicitly in an interview: "I use Postgres here because the transfer has to be atomic — I can't have the debit succeed and the credit fail." For contended rows, SELECT ... FOR UPDATE takes a row lock so two transactions can't both spend the same balance; for optimistic concurrency, a version column checked on UPDATE avoids holding locks.
Choose this variant when
- Money movement, inventory decrement, booking a seat
- Any multi-row change that must not partially apply
- Enforcing invariants the application cannot safely check
Idempotent upserts & constraints
Unique constraints plus ON CONFLICT make "create-or-update" safe under retries.
Let the database enforce uniqueness and absorb duplicate writes:
INSERT INTO idempotency_keys (key, response)
VALUES ($1, $2)
ON CONFLICT (key) DO NOTHING; -- duplicate request → no-opA UNIQUE constraint is a real correctness boundary (two concurrent inserts can't both win), unlike an "index that happens to be unique" enforced in app code. This is the backbone of idempotency keys, "register or fetch existing user", and any retry-safe create — the DB does the dedup under every concurrency scenario.
Choose this variant when
- Idempotency-key storage for retry-safe writes
- Create-if-missing flows (users, accounts)
- Enforcing real uniqueness (usernames, short codes)
JSONB — schemaless when you need it
Store flexible documents inside a relational table, indexed and queryable.
A JSONB column gives you NoSQL-style flexibility without a second database:
ALTER TABLE events ADD COLUMN payload JSONB;
CREATE INDEX ON events USING GIN (payload); -- index containment
SELECT * FROM events WHERE payload @> '{"type":"signup"}';This is the answer to "but I have evolving / semi-structured data" — you keep the relational core (foreign keys, transactions, joins) and put the flexible part in JSONB, indexed with GIN. It undercuts the reflexive "I need NoSQL because my schema changes" claim that interviewers treat as a yellow flag.
Choose this variant when
- Evolving or sparse attributes on an otherwise relational entity
- Event payloads, settings blobs, third-party API responses
- Avoiding a separate document store for a small slice of flexible data
Full-text search (tsvector + GIN)
Built-in search good enough for millions of rows — no separate engine.
Postgres has real full-text search, with stemming and ranking:
CREATE INDEX ON articles USING GIN (to_tsvector('english', body));
SELECT id, ts_rank(to_tsvector('english', body), q) AS rank
FROM articles, to_tsquery('english', 'redis & cluster') q
WHERE to_tsvector('english', body) @@ q
ORDER BY rank DESC LIMIT 20;Up to roughly 10M documents this is genuinely sufficient and keeps search transactionally consistent with your data — a major simplification over running Elasticsearch. You graduate to a dedicated search engine only when scale, faceting, or custom relevance tuning exceed what this offers.
Choose this variant when
- Search over up to ~10M rows
- Keeping search consistent with the primary data
- Avoiding the ops cost of a separate search cluster
Advisory locks & LISTEN/NOTIFY
Lightweight coordination and pub/sub without standing up another system.
Two underused features. Advisory locks give you a named application-level lock backed by the DB:
SELECT pg_try_advisory_lock(42); -- e.g. "only one worker runs this job"LISTEN/NOTIFY is a simple publish/subscribe channel:
LISTEN job_ready; -- consumer
NOTIFY job_ready, 'job:991'; -- producer (often from a trigger)These let a modest system get leader-election-ish coordination and event nudges from the database it already runs — handy for "process each row once" workers — though for heavy coordination you still reach for a dedicated system (etcd) or a real broker (Kafka).
Choose this variant when
- Single-runner jobs without a separate lock service
- Nudging workers when rows change (outbox-lite)
- Small systems that want to avoid extra infrastructure
Partitioning & read replicas
Native table partitioning for huge tables; streaming replicas for read scale.
When a table gets large, declarative partitioning splits it by range/list/hash so queries touch only relevant partitions and old data is dropped cheaply:
CREATE TABLE events (id bigint, created_at timestamptz, ...)
PARTITION BY RANGE (created_at);
-- monthly partitions; DROP an old partition to expire data instantlyFor read scale, add streaming replicas and route read-only queries to them — reads scale horizontally while writes stay on the primary. The two together (partition the hot table, offload reads to replicas) take a single Postgres a very long way before you ever need to shard across primaries.
Choose this variant when
- Time-series / append-heavy tables with retention
- Read-heavy workloads that outgrow one node's read capacity
- Deferring sharding by scaling vertically + replicas first
Operating knobs
Indexing strategy
The single biggest performance lever. B-tree for equality/range, GIN for full-text and JSONB, GiST for geo/ranges, BRIN for huge time-ordered tables. Index the columns your hot queries filter and sort on; use EXPLAIN ANALYZE to confirm the planner uses them. Every index also taxes writes, so cap them to the ones that serve real queries.
Connection pooling
Each Postgres connection is a backend process, so a few hundred is the practical ceiling. A serverless or high-concurrency app must put PgBouncer (transaction-pooling mode) in front, multiplexing thousands of client connections onto a small pool. "Too many connections" is a classic outage that pooling prevents.
Replication: async vs sync
Async streaming replication (default) is fast but a failover can lose the last few unreplicated writes (non-zero RPO). Synchronous replication waits for a replica to confirm, giving RPO≈0 at the cost of write latency. Choose per the data: async for most, sync for a ledger you cannot lose.
Vacuum / autovacuum tuning
MVCC produces dead tuples that autovacuum reclaims. On high-write tables, under-tuned autovacuum lets bloat grow, degrading performance, and a long-running transaction blocks vacuum entirely. Keep transactions short and watch table/index bloat on write-heavy workloads.
Versus the alternatives
PostgreSQL vs the usual alternatives.
| Dimension | PostgreSQL | MySQL | DynamoDB |
|---|---|---|---|
| Data model | Relational + JSONB + extensions (PostGIS, FTS) | Relational | Key-value / document |
| Consistency | ACID, strong by default | ACID (InnoDB) | Tunable strong or eventual per read |
| Scaling | Vertical + read replicas; sharding is manual | Similar | Horizontal, automatic, fully managed |
| Queries | Rich SQL: joins, CTEs, window fns | Rich SQL | Key-access only; design tables per access pattern |
| Best for | Default for product design + relational data | Same niche, slightly leaner footprint | Predictable key-access at massive scale, ops-light |
Failure modes & gotchas
Because each connection is a process, a few hundred is the ceiling. A spike of clients (or a serverless fleet) opening connections directly exhausts them and the database starts refusing connections. Front Postgres with PgBouncer transaction pooling; treat raw connections as a scarce resource.
A query filtering on an unindexed column scans the whole table — fine at thousands of rows, fatal at millions. Add the right index and confirm with EXPLAIN ANALYZE. The classic interview tell is proposing a query without saying which index serves it.
A transaction left open for minutes holds back autovacuum, so dead tuples accumulate and the table bloats — queries slow down and disk grows. Keep transactions short, and never wrap slow external calls inside an open DB transaction.
Replicas scale reads, not writes — every write still funnels through one primary. When sustained write throughput exceeds one node, you must partition the workload or shard across primaries (Citus, app-level sharding), which is a real project. Recognise the ceiling before you hit it.
Async replicas lag, so a user who just wrote and immediately reads from a replica may not see their change. For those flows, route the read to the primary (or wait for the replica LSN), rather than assuming replicas are instantly consistent.
In production
Notion
Sharding Postgres only when one primary truly ran out
Notion ran its entire product on a single Postgres instance for years, and only sharded when the table holding every "block" (the unit of content in Notion) grew past what one primary could serve — into the hundreds of billions of rows. Their public write-up is a masterclass in not sharding prematurely: they pushed vertical scaling and connection pooling (PgBouncer) a long way first, and when they finally sharded, they partitioned by workspace id across many logical shards mapped onto a smaller number of physical Postgres instances.
The takeaway interviewers love is the discipline: a single Postgres handled a wildly successful product to enormous scale before sharding became necessary, and even then the relational model and ACID guarantees were worth keeping rather than fleeing to NoSQL.
Postgres at massive scale with logical sharding
Instagram built on Postgres and scaled it to hundreds of millions of users by logically sharding — thousands of logical shards spread across a smaller set of physical Postgres machines, with a globally-unique id scheme (a custom 64-bit id encoding timestamp + shard id + sequence) so every row knows which shard it belongs to. They leaned on Postgres features rather than abandoning them: partial and functional indexes to keep hot queries fast, and careful schema design to keep the working set in memory.
It is a canonical example that "Postgres can't scale" is a myth: the relational model, indexing flexibility, and a deliberate sharding scheme took it to social-network scale while preserving SQL.
Good vs bad answer
Interviewer probe
“You're designing a payments / wallet system. What database, and why?”
Weak answer
"I'd use a NoSQL database like DynamoDB because it scales better and payments need to be fast and highly available."
Strong answer
"PostgreSQL, specifically for its ACID transactions. A transfer debits one account and credits another, and those have to be atomic — a crash can't leave money debited but not credited. I wrap the two updates in a transaction with SELECT ... FOR UPDATE on the rows so two concurrent transfers can't double-spend a balance, and I store an idempotency key with a UNIQUE constraint so a retried request can't post the transfer twice. A single primary comfortably handles tens of thousands of TPS, which is plenty for most payment volumes; I scale reads (statements, balance checks) with replicas, and only consider sharding if write volume genuinely outgrows one node. I'd reach for NoSQL only if the access pattern were simple key-lookups at a scale Postgres couldn't take — but here the correctness guarantees are the whole point, so trading them away for theoretical scale would be backwards."
Why it wins: Leads with the specific property (ACID) the problem demands, names the concrete mechanisms (FOR UPDATE, unique-constraint idempotency), sizes the throughput honestly, and explicitly rejects the SQL-vs-NoSQL reflex in favour of matching guarantees to requirements.
Interview playbook
When it comes up
- The default datastore choice in almost any product-design round
- Anything transactional — payments, orders, bookings, inventory
- The interviewer asks "SQL or NoSQL?" (answer with properties, not dogma)
- When you need relational queries, joins, or strong uniqueness
Order of reveal
- 11. Justify by property, not by category. I pick Postgres because the workload needs ACID transactions and relational queries — not because "SQL is better than NoSQL".
- 22. Size it honestly. One primary handles tens of thousands of TPS and terabytes, so I start there and add complexity only when the numbers demand it.
- 33. Name the index. For each hot query I state the index that serves it — a B-tree here, a GIN index for this JSONB / full-text lookup.
- 44. Scale reads, then writes. Reads scale with replicas; writes stay on the primary, so I partition the hot table and only shard when write volume truly outgrows one node.
- 55. Guard the operational edges. Connection pooling with PgBouncer, short transactions to avoid vacuum bloat, and read-your-writes handled by reading the primary when needed.
Signature phrases
- “I'll use Postgres because its ACID guarantees keep this consistent.” — Justifies the choice by the property the problem needs.
- “A single primary handles tens of thousands of TPS — I won't shard prematurely.” — Signals you know the real capacity and avoid over-engineering.
- “Replicas scale reads; writes stay on the primary.” — Shows you know exactly where the scaling ceiling is.
- “A unique constraint is a correctness boundary, not just an index.” — Demonstrates idempotency / concurrency awareness.
Likely follow-ups
?“When would you actually move off Postgres?”Reveal
When a specific requirement forces it: sustained write throughput beyond what one primary can take and you don't want to run manual sharding (DynamoDB/Cassandra); a strictly key-access pattern at massive scale where you want managed horizontal scaling and minimal ops (DynamoDB); or sub-millisecond hot-path reads / coordination (Redis in front). I move for a named reason, not because a table got big — partitioning and replicas handle "big" for a long time.
?“Your read replicas are lagging and users don't see their own writes. What do you do?”Reveal
For read-your-writes flows, route that specific read to the primary (or pin the session to the primary for a short window after a write). Alternatively, wait until the replica has applied the write's WAL position (LSN) before reading from it. I would not blanket-route everything to the primary — that defeats the replicas — just the flows where a user reads immediately after writing.
?“How do you handle 50,000 concurrent client connections?”Reveal
Not as 50,000 Postgres connections — each is a backend process and the practical ceiling is in the hundreds. I put PgBouncer in transaction-pooling mode in front, so thousands of clients multiplex onto a small pool of real connections that are only held for the duration of each transaction. The app talks to PgBouncer; PgBouncer rations the scarce real connections.
Worked example
Setup. Design the data layer for a ticket-booking system (think a smaller Ticketmaster): users browse events, then buy a specific seat. The hard requirement is that two people can never buy the same seat — and the interviewer is watching for how you guarantee it.
The move. This is an ACID problem, so Postgres is the answer and I say why out loud: I need a transactional guarantee that a seat goes to exactly one buyer. The core tables are events, seats(event_id, seat_no, status, held_by, held_until), and bookings. The purchase is one transaction:
BEGIN;
SELECT status FROM seats
WHERE event_id = $1 AND seat_no = $2
FOR UPDATE; -- row lock: serializes contenders
-- if already sold, abort
UPDATE seats SET status='sold', held_by=$user WHERE ...;
INSERT INTO bookings (...);
COMMIT;SELECT ... FOR UPDATE takes a row-level lock so two concurrent buyers of the same seat are serialized — the second blocks, then sees sold and aborts. The booking is idempotent on a client request key (a UNIQUE constraint on idempotency_keys) so a retried "buy" can't double-insert.
Capacity check. Even a popular on-sale is a few thousand purchases/sec at peak — a single Postgres primary (tens of thousands of TPS) handles the writes without sharding. The pressure is on reads: tens of millions of fans refreshing the seat map. So I scale reads with replicas and a Redis cache for the seat-availability view, and keep the write path — the actual purchase — on the single primary where the lock and the ACID guarantee live.
What breaks. The row lock is also the contention point: for a single hot seat, contenders queue. That's correct (one winner) but I keep the transaction tiny — lock, check, update, commit, no slow calls inside it — so locks are held for microseconds. I'd also watch connection count: a flood of buyers means I front Postgres with PgBouncer so thousands of clients multiplex onto a small pool.
The result. Provably no double-sells (the DB enforces it under every concurrency scenario), a write path that fits comfortably on one primary, reads scaled out to replicas + cache, and idempotent purchases — all from properties Postgres gives you for free, which is exactly why you name it for this problem.
Cheat sheet
- •Default for product design. Justify by property (ACID, relational), not "SQL vs NoSQL".
- •One primary ≈ 10K–50K TPS and 10 TB+. Don't shard at 1K QPS.
- •WAL = durability + replication feed. Commit is durable when it's fsynced to the WAL.
- •MVCC: updates write new versions; VACUUM reclaims dead ones. Keep transactions short.
- •Indexes: B-tree (default), GIN (FTS/JSONB), GiST (geo), BRIN (huge time-ordered). Verify with EXPLAIN ANALYZE.
- •Scale reads with replicas; writes stay on the primary. Partition big tables; shard last.
- •Each connection is a process → use PgBouncer. Async replicas lag → mind read-your-writes.
- •JSONB + GIN for flexible data; tsvector + GIN for search; ON CONFLICT for idempotent upserts.
Drills
Why is a commit durable once it is in the WAL, before the table page is even updated?Reveal
Because recovery replays the WAL. The write-ahead log records the change and is fsynced to disk at commit; the actual table pages are updated in the buffer cache and flushed lazily. If the server crashes before those pages hit disk, startup replays the WAL to reconstruct them. So durability is a property of the log, which is also why the WAL is what streaming replicas consume.
Interviewer: "your table has 200M rows and a query on `email` is slow. Why, and fix it?"Reveal
No usable index on email, so the planner does a sequential scan of all 200M rows. Add a B-tree index on email (or a unique index if emails are unique), and confirm with EXPLAIN ANALYZE that the plan switches from Seq Scan to an Index Scan. If the column is high-cardinality and frequently filtered, this turns a multi-second scan into a sub-millisecond lookup.
You're told "Postgres can't scale, use NoSQL". How do you respond?Reveal
Push back with numbers. A single primary handles tens of thousands of TPS and terabytes; reads scale with replicas and big tables scale with partitioning. That covers the vast majority of products. Postgres's real ceiling is write throughput on one primary — when sustained writes exceed that, you partition or shard. So "NoSQL for scale" is only right when the workload is genuinely key-access at a write volume beyond one node; otherwise it trades away ACID and joins for scale you don't need yet.
What it is