Apache Cassandra
A masterless, linearly-scalable wide-column store built for enormous write throughput and multi-region availability — when you have a write firehose and known partition-key queries, it scales by simply adding nodes.
Also worth naming: ScyllaDB (C++ rewrite, faster) · DataStax Enterprise · Amazon Keyspaces (managed Cassandra API)
Cassandra has no leader and an append-only storage engine, which is exactly why it swallows writes other databases choke on. The price is rigidity: you design tables per query and give up joins, ad-hoc filters, and easy strong consistency.
What it is
Cassandra is a distributed wide-column database designed around two goals: massive write throughput and no single point of failure. It achieves both by being masterless — every node is an equal peer, there is no leader to bottleneck or fail — and by using an LSM-tree storage engine where writes are append-only.
Data is partitioned across a token ring: the partition key is hashed to a token, and each partition is replicated to N consecutive nodes (the replication factor). Any node can act as the coordinator for a request and route it to the replicas. Because there is no leader, Cassandra stays available for writes even during node failures and network partitions — it sits firmly on the AP side of CAP, with tunable consistency so you can dial in stronger guarantees per query when you need them.
The trade-off is that Cassandra is query-first, not entity-first: like DynamoDB, you design each table to serve a specific access pattern keyed by the partition key, and you deliberately denormalise. No joins, no ad-hoc WHERE on non-key columns without penalty, and read-modify-write / transactions are awkward. In an interview, Cassandra is the answer for write-heavy, time-series, multi-region, high-availability workloads with known queries — and the wrong answer when you need flexible relational queries or strong transactional consistency.
When to reach for it
Reach for this when…
- Write-heavy workloads at scale — time-series, event logs, sensor/IoT data, messaging history, feeds
- You need high availability and multi-region writes with no single point of failure
- Predictable, partition-key-based queries you can design tables around
- Linear scale by adding commodity nodes, self-hosted (where a managed DynamoDB is not an option)
Not really this pattern when…
- You need joins, ad-hoc queries, or aggregations (that is SQL / a warehouse)
- You need strong multi-row ACID transactions (Postgres, Spanner)
- Read-heavy with simple lookups and you want zero ops — DynamoDB is managed; Redis for hot reads
- Small scale where running and tuning a Cassandra cluster is overkill
How it works
Three mechanisms explain Cassandra's behaviour:
1. Masterless ring + replication = availability. Every node owns a range of the token ring and is a peer. A client connects to any node (the coordinator), which hashes the partition key and forwards to the N replicas. With no leader, there is nothing whose failure stops writes — Cassandra keeps accepting writes through node and even region failures, which is its core promise.
There is no leader. A coordinator (any node the client hits) hashes the partition key onto a token ring and routes to the replicas. With replication factor 3, each partition lives on three consecutive nodes, so any node can fail without data loss.
2. Tunable consistency via quorums. You set how many replicas must respond per operation: ONE, QUORUM, ALL, LOCAL_QUORUM (within a datacenter), etc. The classic rule is W + R > N for strong consistency — with replication factor 3, QUORUM writes and reads (2 + 2 > 3) guarantee a read sees the latest committed write. Drop to ONE for maximum speed and availability at the cost of possibly stale reads.
You choose how many replicas must acknowledge a write (W) and respond to a read (R). With N=3, QUORUM means 2 — and because 2 + 2 > 3, a quorum read always overlaps the latest quorum write. Lower the levels for speed and availability; raise them for consistency.
3. LSM-tree write path = write throughput. A write appends to the commit log (for durability) and updates an in-memory memtable — both sequential, no random disk I/O, no read-before-write. Memtables flush to immutable SSTables; background compaction merges them. This append-only design is why Cassandra ingests writes a B-tree database would choke on — but it is also the source of its pain points: tombstones (deletes are markers, not removals) and read amplification across SSTables.
A write appends to the commit log (durability) and updates an in-memory memtable — both fast, sequential operations. The memtable periodically flushes to an immutable SSTable on disk; background compaction merges SSTables. This append-only design is why Cassandra absorbs huge write volume.
Data model: a table has a partition key (which node owns the data) plus optional clustering keys (the sort order within a partition). You design one table per query — "messages by conversation, newest first" is its own table, keyed and clustered to make that exact read a single fast partition scan.
Performance envelope
Cassandra performance envelope — the numbers to quote.
| Dimension | Number | Why it matters |
|---|---|---|
| Write throughput | ~10K–50K+ writes/sec per node, linear | Append-only LSM; add nodes for more — the headline strength |
| Scaling | Linear with node count | Double the nodes ≈ double the capacity, no resharding event |
| Availability | No SPOF; survives node & region loss | Masterless + replication; AP under partition |
| Consistency | Tunable per query (ONE → QUORUM → ALL) | W + R > N for strong; trade for latency/availability |
| Read cost | Can touch multiple SSTables + tombstones | Why range deletes and wide partitions hurt reads |
| Multi-region | Native multi-DC replication | LOCAL_QUORUM keeps writes fast in-region |
Capabilities in interviews
Time-series & event storage
Append huge volumes of timestamped data, queried as ranges within a partition.
Cassandra's append-only engine is ideal for time-series: metrics, sensor readings, audit logs, click events. Partition by the entity, cluster by time:
CREATE TABLE readings (
device_id text, ts timestamp, value double,
PRIMARY KEY ((device_id), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
-- "last hour for this device" = one partition range readWatch partition size: bucket by day/week (PRIMARY KEY ((device_id, day), ts)) so no single partition grows unbounded. This is the canonical shape for IoT and metrics ingestion at write rates that overwhelm OLTP databases.
Choose this variant when
- Sensor / IoT / metrics ingestion
- Append-only event or audit logs
- Any high-write timestamped data
Messaging & feed history
Store enormous volumes of messages or feed items keyed by conversation or user.
Chat and feed backends are write-heavy and naturally partition by conversation or user, with time ordering — a perfect Cassandra fit (famously, Discord runs messages on it):
PRIMARY KEY ((channel_id, bucket), message_id) -- bucketed by time windowReads are "latest N in this channel," a single clustered partition read. Bucketing keeps partitions bounded for high-traffic channels. The write firehose of millions of messages/sec is exactly what the LSM engine is built for.
Choose this variant when
- Chat / messaging history at scale
- Per-user or per-channel feeds
- Write rates that exceed a single relational primary
Multi-region, always-on writes
Replicate across datacenters with local-quorum writes for availability and low latency.
Cassandra's multi-DC replication makes it a strong pick when writes must continue through a regional outage and stay fast locally:
LOCAL_QUORUM -- acknowledge within the local datacenterEach region writes with LOCAL_QUORUM (fast, no cross-region wait) and asynchronously replicates to the others. The masterless design means there is no failover event — a region can disappear and the rest keep serving. The trade is eventual cross-region consistency, resolved by last-write-wins timestamps.
Choose this variant when
- Global apps needing always-on writes
- Data-residency / multi-DC requirements
- No-single-point-of-failure mandates
Operating knobs
Partition key & partition size
The partition key decides data placement and which queries are cheap. Keep partitions bounded — a partition that grows without limit (a hot channel, a high-frequency device) causes slow reads and compaction pain. Bucket by a time window or hash to cap size. Wide-partition mistakes are the most common Cassandra failure.
Consistency level
Pick per query. LOCAL_QUORUM is the workhorse for multi-DC (strong within a region, fast). ONE maximises availability and speed where staleness is fine. QUORUM/ALL raise consistency at latency and availability cost. The W + R > N rule tells you when reads are guaranteed to see the latest write.
Replication factor & strategy
RF 3 per datacenter is the standard — survives one node loss per DC at QUORUM. NetworkTopologyStrategy places replicas across racks/DCs for fault isolation. Higher RF means more durability and read availability at higher storage and write cost.
Compaction strategy
How SSTables merge shapes read performance and space. Size-tiered suits write-heavy/append workloads; leveled gives predictable read latency for read-heavy/update workloads at higher write amplification; time-window is ideal for time-series with TTL. Matching compaction to the workload is a real operational lever.
Versus the alternatives
Cassandra vs the alternatives.
| Dimension | Cassandra | DynamoDB | PostgreSQL |
|---|---|---|---|
| Topology | Masterless ring, you operate it | Managed, serverless | Single primary + replicas |
| Best at | Write-heavy, multi-region, HA | Managed key-access at scale | Relational + transactions |
| Consistency | Tunable (AP by default) | Eventual or strong (per read) | Strong / ACID |
| Queries | Per-partition, table-per-query | Key-based, pre-designed | Ad-hoc SQL, joins |
| Ops burden | High (cluster tuning, compaction) | None | Moderate |
Failure modes & gotchas
A partition key that keeps accumulating rows (a popular channel, a busy sensor) creates a huge partition that is slow to read and painful to compact. Bound partitions by bucketing on a time window or a hash suffix in the key. This is the #1 Cassandra modelling mistake.
Deletes write tombstones (markers), not removals; they linger until compaction past the GC grace period. Heavy delete/update or queue-like patterns accumulate tombstones that make reads scan and skip thousands of dead rows. Avoid using Cassandra as a queue; model so you rarely delete.
There are no joins, and filtering on non-key columns requires ALLOW FILTERING — a full scan that does not scale. If you need a query your tables were not designed for, you add another denormalised table or a secondary system; you do not bolt an ad-hoc query onto Cassandra.
Cassandra is bad at read-then-write logic. Lightweight transactions (IF NOT EXISTS, compare-and-set) use Paxos and are far slower than normal writes — fine occasionally, a bottleneck if central. If your workload is transactional, Cassandra is the wrong tool.
Modelling normalised tables and joining in the app, or expecting strong consistency by default, fights the engine. Cassandra rewards query-first denormalised modelling and tunable (often eventual) consistency. If that mismatch is uncomfortable, the workload probably wants Postgres.
In production
Discord
Trillions of messages — from Cassandra to ScyllaDB
Discord's message store is the canonical Cassandra case study. They moved messages onto Cassandra to handle a relentless write firehose, modeling the data exactly as the interview answer prescribes: partition by (channel_id, time_bucket), cluster by message id, bucket to keep partitions bounded. By 2017 they stored billions of messages; by the time they migrated the cluster, it held trillions.
Their write-ups are also a masterclass in the failure modes: they hit hot partitions on busy channels (fixed with bucketing), and severe tombstone problems when messages were deleted, which caused latency spikes as reads scanned dead rows. They eventually migrated from Cassandra (JVM) to ScyllaDB (a C++ reimplementation of the same model) for lower tail latency and fewer nodes — but the data model and trade-offs are identical, which is why "Discord on Cassandra" is the reference answer for write-heavy message storage.
Apple / Netflix
Cassandra at planetary scale for always-on workloads
Some of the largest Cassandra fleets in existence run at Apple — publicly cited at well over 100,000 nodes storing hundreds of petabytes — and at Netflix, which runs thousands of nodes across regions for viewing data, user state, and operational metrics. Both chose Cassandra for the same reasons: massive write throughput, linear scale by adding nodes, and no single point of failure with multi-region, always-on availability.
Netflix's engineering posts emphasize the masterless, multi-datacenter design — writing with LOCAL_QUORUM for fast in-region consistency while replicating across AWS regions, so a regional outage never stops writes. The takeaway both reinforce: when availability and write scale are the hard requirements and queries are predictable and key-based, Cassandra scales horizontally further than almost anything else — at the cost of operating the cluster.
Good vs bad answer
Interviewer probe
“A chat app stores billions of messages and ingests 1M writes/sec globally. What database backs the message history?”
Weak answer
"A sharded relational database — partition the messages table by conversation across many MySQL shards, and add read replicas for the reads."
Strong answer
"Cassandra. The workload is a write firehose — 1M writes/sec, append-mostly, global — which is exactly what its masterless, LSM-tree design is built for; you scale by adding nodes linearly instead of operating a manual shard map. Table is keyed PRIMARY KEY ((channel_id, day_bucket), message_id): partition by channel plus a day bucket so hot channels don't create unbounded partitions, clustered by message id so 'latest N messages' is one partition read. Writes use LOCAL_QUORUM so they're fast and strongly consistent within a region while replicating across regions asynchronously — no failover event if a region drops. I'd keep deletes rare to avoid tombstone build-up, and pair it with a cache or search index for non-key access like full-text. Hand-sharding MySQL gives me the same partitioning but with a fragile manual shard map, a primary per shard that bottlenecks writes, and a painful resharding event when I grow — Cassandra makes node addition routine."
Why it wins: Matches the engine to a write-heavy global workload, designs a bucketed key to avoid wide partitions, picks LOCAL_QUORUM with the consistency reasoning, anticipates tombstones and non-key access, and contrasts against manual sharding's real operational cost.
Interview playbook
When it comes up
- Write-heavy at scale — messaging history, time-series, IoT, event logs, feeds
- Multi-region / always-on-writes / no-single-point-of-failure requirements
- The interviewer pushes past what one relational primary can write
- Known partition-key queries you can design tables around
Order of reveal
- 11. Justify by write pattern + HA. This is write-heavy and needs no single point of failure, so I use Cassandra — masterless, append-only, linear scale.
- 22. Design the key per query. Partition by the entity, cluster by time, and bucket the partition so it stays bounded.
- 33. Pick the consistency level. LOCAL_QUORUM for strong-in-region multi-DC writes; I cite W + R > N when I need guaranteed-fresh reads.
- 44. Call the pain points. No joins, deletes make tombstones, and wide partitions hurt — I model to avoid all three.
- 55. Pair for what it cannot do. Full-text or ad-hoc queries go to a search index or warehouse fed from Cassandra, not ALLOW FILTERING.
Signature phrases
- “Masterless and append-only — no leader to bottleneck, no failover event.” — Captures the availability + write-throughput story.
- “Query-first, table-per-access-pattern, deliberately denormalised.” — Shows you model Cassandra correctly, not like SQL.
- “Bucket the partition key so partitions stay bounded.” — Pre-empts the #1 modelling mistake.
- “W + R > N for strong reads; LOCAL_QUORUM keeps multi-DC writes fast.” — Demonstrates tunable-consistency command.
Likely follow-ups
?“How does Cassandra stay available when a node or whole region fails?”Reveal
There is no leader, so no election or failover. The partition key is replicated to N peer nodes (RF 3 per DC); if one replica is down, the coordinator still satisfies the request from the others as long as the chosen consistency level can be met — LOCAL_QUORUM needs 2 of 3, so one node down is fine. Across regions, each DC has its own replicas and writes with LOCAL_QUORUM locally, replicating asynchronously, so an entire region can vanish and the rest keep serving. That AP posture is the whole point; the cost is eventual cross-region consistency, reconciled by last-write-wins timestamps.
?“Why are deletes a problem?”Reveal
Because the LSM engine is append-only, a delete cannot remove a row in place — it writes a tombstone marking the row deleted. The real data and the tombstone coexist until compaction reclaims them after the GC grace period (default 10 days, to let the deletion propagate to all replicas). Workloads with heavy deletes or updates — especially queue-like patterns — pile up tombstones that reads must scan and skip, tanking read latency. The fix is to avoid delete-heavy designs: use TTLs for expiry, time-window compaction for time-series, and never model a work queue on Cassandra.
?“When would you pick DynamoDB over Cassandra?”Reveal
When I want the same scaling and key-based model but without operating the cluster. DynamoDB is fully managed — no nodes, compaction tuning, repair, or capacity ops — so for a team on AWS that does not want to run Cassandra, it is the lower-effort choice for write-heavy key-access at scale. I pick Cassandra when I need to self-host (no managed option, on-prem, multi-cloud), want fine control over consistency and compaction, or need a specific multi-DC topology. They solve a similar problem; the deciding factor is usually operational ownership.
Worked example
Setup. Design the message-storage layer for a global chat app: 1M messages/sec written worldwide, billions stored, "load the latest 50 messages in a channel" must be fast, and writes must keep flowing even if a whole region goes down.
The move. This is a write firehose with known per-channel reads and an HA requirement — squarely Cassandra. The table is keyed PRIMARY KEY ((channel_id, day_bucket), message_id): partition by channel plus a day bucket so a busy channel doesn't create an unbounded partition, and cluster by message_id so "latest 50" is a single ordered partition read with ORDER BY message_id DESC LIMIT 50. The append-only LSM engine swallows the 1M/sec writes by adding nodes linearly — no single primary to bottleneck.
Consistency + multi-region. Replication factor 3 per datacenter; writes use `LOCAL_QUORUM` (2 of 3 in the local DC) so they're strongly consistent in-region and fast, replicating asynchronously to other regions. Because the ring is masterless, there is no failover event — a region can vanish and the others keep accepting writes. I cite W + R > N when I need a guaranteed-fresh read.
Bounding partitions. The day-bucket in the key is the critical modeling decision: without it, a channel with millions of messages becomes a multi-GB partition that is slow to read and painful to compact. With it, each partition is one day of one channel — bounded and cheap.
What breaks. Tombstones. If I let users delete messages heavily, deletes write tombstones (not removals) that pile up and slow reads until compaction past the GC grace period. I keep deletes rare, use TTLs for ephemeral data, and pick time-window compaction for this time-series shape. The second gap is non-key access (full-text search across messages) — that is not Cassandra's job, so I feed a search index (Elasticsearch) from the write path rather than reaching for ALLOW FILTERING.
The result. 1M writes/sec absorbed by adding commodity nodes, billions of messages stored, "latest N in a channel" served as one fast partition read, and writes that survive a full regional outage with no failover — the workload Cassandra was built for.
Cheat sheet
- •Masterless wide-column store: no leader, append-only LSM, linear scale by adding nodes.
- •Built for write-heavy + high-availability + multi-region. AP by default, tunable consistency.
- •Model query-first: partition key = placement, clustering key = order. One table per access pattern.
- •Bound partitions — bucket by time/hash so none grows unbounded (the #1 mistake).
- •Consistency per query: ONE → LOCAL_QUORUM → QUORUM → ALL. W + R > N for strong reads.
- •RF 3 per DC standard; LOCAL_QUORUM for fast strong multi-DC writes.
- •Deletes = tombstones (not removals) → avoid delete-heavy / queue patterns. Use TTL.
- •No joins, no ad-hoc filters (ALLOW FILTERING = scan). Pair with search/warehouse for those.
Drills
Why does Cassandra handle writes that overwhelm a relational database?Reveal
Its LSM-tree write path is append-only: a write hits the commit log (sequential) and an in-memory memtable, with no read-before-write and no random disk I/O, then memtables flush to immutable SSTables in the background. A B-tree relational engine does in-place updates with random I/O and page splits, which become the bottleneck under heavy write load. Combined with the masterless ring (no single primary to funnel writes through), Cassandra scales writes linearly by adding nodes.
Interviewer: "your Cassandra reads got slow over time. Why might that be?"Reveal
Most likely wide partitions or tombstone build-up. If a partition key accumulates rows without bound, reads scan an enormous partition — fix by bucketing the key. If the workload deletes or updates heavily, tombstones accumulate and reads waste time scanning and skipping dead rows until compaction reclaims them — fix by avoiding delete-heavy patterns, using TTLs, and choosing time-window compaction. Read amplification across many un-compacted SSTables is a third cause, addressed by tuning the compaction strategy.
You set RF=3 and write with consistency ONE. What did you trade away?Reveal
Strong consistency. Writing with ONE returns after a single replica acknowledges, so it is fast and highly available — but a subsequent read at ONE might hit a replica that has not yet received the write, returning stale data. To guarantee a read sees the latest write you need W + R > N: with N=3 that means QUORUM (2) on both. ONE/ONE maximises speed and availability; you accept eventual consistency and rely on read-repair and anti-entropy to converge replicas over time.
What it is