Technology·Databases

Apache Cassandra

A masterless, linearly-scalable wide-column store built for enormous write throughput and multi-region availability — when you have a write firehose and known partition-key queries, it scales by simply adding nodes.

Also worth naming: ScyllaDB (C++ rewrite, faster) · DataStax Enterprise · Amazon Keyspaces (managed Cassandra API)

~25 min read·15 sections

Cassandra has no leader and an append-only storage engine, which is exactly why it swallows writes other databases choke on. The price is rigidity: you design tables per query and give up joins, ad-hoc filters, and easy strong consistency.

What it is

Cassandra is a distributed wide-column database designed around two goals: massive write throughput and no single point of failure. It achieves both by being masterless — every node is an equal peer, there is no leader to bottleneck or fail — and by using an LSM-tree storage engine where writes are append-only.

Data is partitioned across a token ring: the partition key is hashed to a token, and each partition is replicated to N consecutive nodes (the replication factor). Any node can act as the coordinator for a request and route it to the replicas. Because there is no leader, Cassandra stays available for writes even during node failures and network partitions — it sits firmly on the AP side of CAP, with tunable consistency so you can dial in stronger guarantees per query when you need them.

The trade-off is that Cassandra is query-first, not entity-first: like DynamoDB, you design each table to serve a specific access pattern keyed by the partition key, and you deliberately denormalise. No joins, no ad-hoc WHERE on non-key columns without penalty, and read-modify-write / transactions are awkward. In an interview, Cassandra is the answer for write-heavy, time-series, multi-region, high-availability workloads with known queries — and the wrong answer when you need flexible relational queries or strong transactional consistency.

When to reach for it

Reach for this when…

Write-heavy workloads at scale — time-series, event logs, sensor/IoT data, messaging history, feeds
You need high availability and multi-region writes with no single point of failure
Predictable, partition-key-based queries you can design tables around
Linear scale by adding commodity nodes, self-hosted (where a managed DynamoDB is not an option)

Not really this pattern when…

You need joins, ad-hoc queries, or aggregations (that is SQL / a warehouse)
You need strong multi-row ACID transactions (Postgres, Spanner)
Read-heavy with simple lookups and you want zero ops — DynamoDB is managed; Redis for hot reads
Small scale where running and tuning a Cassandra cluster is overkill

How it works

Three mechanisms explain Cassandra's behaviour:

1. Masterless ring + replication = availability. Every node owns a range of the token ring and is a peer. A client connects to any node (the coordinator), which hashes the partition key and forwards to the N replicas. With no leader, there is nothing whose failure stops writes — Cassandra keeps accepting writes through node and even region failures, which is its core promise.

Architecture diagram· A masterless ring — every node is a peer, data is replicated to N nodes

There is no leader. A coordinator (any node the client hits) hashes the partition key onto a token ring and routes to the replicas. With replication factor 3, each partition lives on three consecutive nodes, so any node can fail without data loss.

2. Tunable consistency via quorums. You set how many replicas must respond per operation: ONE, QUORUM, ALL, LOCAL_QUORUM (within a datacenter), etc. The classic rule is W + R > N for strong consistency — with replication factor 3, QUORUM writes and reads (2 + 2 > 3) guarantee a read sees the latest committed write. Drop to ONE for maximum speed and availability at the cost of possibly stale reads.

Architecture diagram· Tunable consistency: W + R > N gives strong reads

You choose how many replicas must acknowledge a write (W) and respond to a read (R). With N=3, QUORUM means 2 — and because 2 + 2 > 3, a quorum read always overlaps the latest quorum write. Lower the levels for speed and availability; raise them for consistency.

3. LSM-tree write path = write throughput. A write appends to the commit log (for durability) and updates an in-memory memtable — both sequential, no random disk I/O, no read-before-write. Memtables flush to immutable SSTables; background compaction merges them. This append-only design is why Cassandra ingests writes a B-tree database would choke on — but it is also the source of its pain points: tombstones (deletes are markers, not removals) and read amplification across SSTables.

Architecture diagram· LSM write path: commit log + memtable, flushed to immutable SSTables

A write appends to the commit log (durability) and updates an in-memory memtable — both fast, sequential operations. The memtable periodically flushes to an immutable SSTable on disk; background compaction merges SSTables. This append-only design is why Cassandra absorbs huge write volume.

Data model: a table has a partition key (which node owns the data) plus optional clustering keys (the sort order within a partition). You design one table per query — "messages by conversation, newest first" is its own table, keyed and clustered to make that exact read a single fast partition scan.

Performance envelope

Cassandra performance envelope — the numbers to quote.

Dimension	Number	Why it matters
Write throughput	~10K–50K+ writes/sec per node, linear	Append-only LSM; add nodes for more — the headline strength
Scaling	Linear with node count	Double the nodes ≈ double the capacity, no resharding event
Availability	No SPOF; survives node & region loss	Masterless + replication; AP under partition
Consistency	Tunable per query (ONE → QUORUM → ALL)	W + R > N for strong; trade for latency/availability
Read cost	Can touch multiple SSTables + tombstones	Why range deletes and wide partitions hurt reads
Multi-region	Native multi-DC replication	LOCAL_QUORUM keeps writes fast in-region

Capabilities in interviews

Time-series & event storage

Append huge volumes of timestamped data, queried as ranges within a partition.

Cassandra's append-only engine is ideal for time-series: metrics, sensor readings, audit logs, click events. Partition by the entity, cluster by time:

text

CREATE TABLE readings (
  device_id text, ts timestamp, value double,
  PRIMARY KEY ((device_id), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
-- "last hour for this device" = one partition range read

Watch partition size: bucket by day/week (PRIMARY KEY ((device_id, day), ts)) so no single partition grows unbounded. This is the canonical shape for IoT and metrics ingestion at write rates that overwhelm OLTP databases.

Choose this variant when

Sensor / IoT / metrics ingestion
Append-only event or audit logs
Any high-write timestamped data

Messaging & feed history

Store enormous volumes of messages or feed items keyed by conversation or user.

Chat and feed backends are write-heavy and naturally partition by conversation or user, with time ordering — a perfect Cassandra fit (famously, Discord runs messages on it):

text

PRIMARY KEY ((channel_id, bucket), message_id)  -- bucketed by time window

Reads are "latest N in this channel," a single clustered partition read. Bucketing keeps partitions bounded for high-traffic channels. The write firehose of millions of messages/sec is exactly what the LSM engine is built for.

Choose this variant when

Chat / messaging history at scale
Per-user or per-channel feeds
Write rates that exceed a single relational primary

Multi-region, always-on writes

Replicate across datacenters with local-quorum writes for availability and low latency.

Cassandra's multi-DC replication makes it a strong pick when writes must continue through a regional outage and stay fast locally:

text

LOCAL_QUORUM  -- acknowledge within the local datacenter

Each region writes with LOCAL_QUORUM (fast, no cross-region wait) and asynchronously replicates to the others. The masterless design means there is no failover event — a region can disappear and the rest keep serving. The trade is eventual cross-region consistency, resolved by last-write-wins timestamps.

Choose this variant when

Global apps needing always-on writes
Data-residency / multi-DC requirements
No-single-point-of-failure mandates

Operating knobs

Partition key & partition size

The partition key decides data placement and which queries are cheap. Keep partitions bounded — a partition that grows without limit (a hot channel, a high-frequency device) causes slow reads and compaction pain. Bucket by a time window or hash to cap size. Wide-partition mistakes are the most common Cassandra failure.

Consistency level

Pick per query. LOCAL_QUORUM is the workhorse for multi-DC (strong within a region, fast). ONE maximises availability and speed where staleness is fine. QUORUM/ALL raise consistency at latency and availability cost. The W + R > N rule tells you when reads are guaranteed to see the latest write.

Replication factor & strategy

RF 3 per datacenter is the standard — survives one node loss per DC at QUORUM. NetworkTopologyStrategy places replicas across racks/DCs for fault isolation. Higher RF means more durability and read availability at higher storage and write cost.

Compaction strategy

How SSTables merge shapes read performance and space. Size-tiered suits write-heavy/append workloads; leveled gives predictable read latency for read-heavy/update workloads at higher write amplification; time-window is ideal for time-series with TTL. Matching compaction to the workload is a real operational lever.

Versus the alternatives

Cassandra vs the alternatives.

Dimension	Cassandra	DynamoDB	PostgreSQL
Topology	Masterless ring, you operate it	Managed, serverless	Single primary + replicas
Best at	Write-heavy, multi-region, HA	Managed key-access at scale	Relational + transactions
Consistency	Tunable (AP by default)	Eventual or strong (per read)	Strong / ACID
Queries	Per-partition, table-per-query	Key-based, pre-designed	Ad-hoc SQL, joins
Ops burden	High (cluster tuning, compaction)	None	Moderate

Failure modes & gotchas

Unbounded / wide partitions

A partition key that keeps accumulating rows (a popular channel, a busy sensor) creates a huge partition that is slow to read and painful to compact. Bound partitions by bucketing on a time window or a hash suffix in the key. This is the #1 Cassandra modelling mistake.

Tombstone build-up from deletesAdvanced

Deletes write tombstones (markers), not removals; they linger until compaction past the GC grace period. Heavy delete/update or queue-like patterns accumulate tombstones that make reads scan and skip thousands of dead rows. Avoid using Cassandra as a queue; model so you rarely delete.

Expecting joins or ad-hoc queries

There are no joins, and filtering on non-key columns requires ALLOW FILTERING — a full scan that does not scale. If you need a query your tables were not designed for, you add another denormalised table or a secondary system; you do not bolt an ad-hoc query onto Cassandra.

Read-modify-write and "lightweight transactions"Advanced

Cassandra is bad at read-then-write logic. Lightweight transactions (IF NOT EXISTS, compare-and-set) use Paxos and are far slower than normal writes — fine occasionally, a bottleneck if central. If your workload is transactional, Cassandra is the wrong tool.

Treating it like a relational DB

Modelling normalised tables and joining in the app, or expecting strong consistency by default, fights the engine. Cassandra rewards query-first denormalised modelling and tunable (often eventual) consistency. If that mismatch is uncomfortable, the workload probably wants Postgres.

In production

Discord

Trillions of messages — from Cassandra to ScyllaDB

Discord's message store is the canonical Cassandra case study. They moved messages onto Cassandra to handle a relentless write firehose, modeling the data exactly as the interview answer prescribes: partition by (channel_id, time_bucket), cluster by message id, bucket to keep partitions bounded. By 2017 they stored billions of messages; by the time they migrated the cluster, it held trillions.

Their write-ups are also a masterclass in the failure modes: they hit hot partitions on busy channels (fixed with bucketing), and severe tombstone problems when messages were deleted, which caused latency spikes as reads scanned dead rows. They eventually migrated from Cassandra (JVM) to ScyllaDB (a C++ reimplementation of the same model) for lower tail latency and fewer nodes — but the data model and trade-offs are identical, which is why "Discord on Cassandra" is the reference answer for write-heavy message storage.

Takeaway: Bucket partitions by time to keep them bounded, and design to avoid deletes — wide partitions and tombstones are the two failure modes that bite write-heavy Cassandra workloads.

Apple / Netflix

Cassandra at planetary scale for always-on workloads

Some of the largest Cassandra fleets in existence run at Apple — publicly cited at well over 100,000 nodes storing hundreds of petabytes — and at Netflix, which runs thousands of nodes across regions for viewing data, user state, and operational metrics. Both chose Cassandra for the same reasons: massive write throughput, linear scale by adding nodes, and no single point of failure with multi-region, always-on availability.

Netflix's engineering posts emphasize the masterless, multi-datacenter design — writing with LOCAL_QUORUM for fast in-region consistency while replicating across AWS regions, so a regional outage never stops writes. The takeaway both reinforce: when availability and write scale are the hard requirements and queries are predictable and key-based, Cassandra scales horizontally further than almost anything else — at the cost of operating the cluster.

Takeaway: When write throughput + multi-region availability are the hard requirements, Cassandra scales linearly to enormous fleets — LOCAL_QUORUM keeps in-region writes fast while surviving regional loss.

Good vs bad answer

Interviewer probe

“A chat app stores billions of messages and ingests 1M writes/sec globally. What database backs the message history?”

Weak answer

"A sharded relational database — partition the messages table by conversation across many MySQL shards, and add read replicas for the reads."

Strong answer

"Cassandra. The workload is a write firehose — 1M writes/sec, append-mostly, global — which is exactly what its masterless, LSM-tree design is built for; you scale by adding nodes linearly instead of operating a manual shard map. Table is keyed PRIMARY KEY ((channel_id, day_bucket), message_id): partition by channel plus a day bucket so hot channels don't create unbounded partitions, clustered by message id so 'latest N messages' is one partition read. Writes use LOCAL_QUORUM so they're fast and strongly consistent within a region while replicating across regions asynchronously — no failover event if a region drops. I'd keep deletes rare to avoid tombstone build-up, and pair it with a cache or search index for non-key access like full-text. Hand-sharding MySQL gives me the same partitioning but with a fragile manual shard map, a primary per shard that bottlenecks writes, and a painful resharding event when I grow — Cassandra makes node addition routine."

Why it wins: Matches the engine to a write-heavy global workload, designs a bucketed key to avoid wide partitions, picks LOCAL_QUORUM with the consistency reasoning, anticipates tombstones and non-key access, and contrasts against manual sharding's real operational cost.

Interview playbook

Interview playbook2–4 min when a write-heavy or multi-region datastore is central

When it comes up

Write-heavy at scale — messaging history, time-series, IoT, event logs, feeds
Multi-region / always-on-writes / no-single-point-of-failure requirements
The interviewer pushes past what one relational primary can write
Known partition-key queries you can design tables around

Order of reveal

1
1. Justify by write pattern + HA. This is write-heavy and needs no single point of failure, so I use Cassandra — masterless, append-only, linear scale.
2
2. Design the key per query. Partition by the entity, cluster by time, and bucket the partition so it stays bounded.
3
3. Pick the consistency level. LOCAL_QUORUM for strong-in-region multi-DC writes; I cite W + R > N when I need guaranteed-fresh reads.
4
4. Call the pain points. No joins, deletes make tombstones, and wide partitions hurt — I model to avoid all three.
5
5. Pair for what it cannot do. Full-text or ad-hoc queries go to a search index or warehouse fed from Cassandra, not ALLOW FILTERING.

Signature phrases

“Masterless and append-only — no leader to bottleneck, no failover event.”

“Query-first, table-per-access-pattern, deliberately denormalised.”

“Bucket the partition key so partitions stay bounded.”

“W + R > N for strong reads; LOCAL_QUORUM keeps multi-DC writes fast.”

“Masterless and append-only — no leader to bottleneck, no failover event.” — Captures the availability + write-throughput story.
“Query-first, table-per-access-pattern, deliberately denormalised.” — Shows you model Cassandra correctly, not like SQL.
“Bucket the partition key so partitions stay bounded.” — Pre-empts the #1 modelling mistake.
“W + R > N for strong reads; LOCAL_QUORUM keeps multi-DC writes fast.” — Demonstrates tunable-consistency command.

Likely follow-ups

?“How does Cassandra stay available when a node or whole region fails?”Reveal

There is no leader, so no election or failover. The partition key is replicated to N peer nodes (RF 3 per DC); if one replica is down, the coordinator still satisfies the request from the others as long as the chosen consistency level can be met — LOCAL_QUORUM needs 2 of 3, so one node down is fine. Across regions, each DC has its own replicas and writes with LOCAL_QUORUM locally, replicating asynchronously, so an entire region can vanish and the rest keep serving. That AP posture is the whole point; the cost is eventual cross-region consistency, reconciled by last-write-wins timestamps.

?“Why are deletes a problem?”Reveal

Because the LSM engine is append-only, a delete cannot remove a row in place — it writes a tombstone marking the row deleted. The real data and the tombstone coexist until compaction reclaims them after the GC grace period (default 10 days, to let the deletion propagate to all replicas). Workloads with heavy deletes or updates — especially queue-like patterns — pile up tombstones that reads must scan and skip, tanking read latency. The fix is to avoid delete-heavy designs: use TTLs for expiry, time-window compaction for time-series, and never model a work queue on Cassandra.

?“When would you pick DynamoDB over Cassandra?”Reveal

When I want the same scaling and key-based model but without operating the cluster. DynamoDB is fully managed — no nodes, compaction tuning, repair, or capacity ops — so for a team on AWS that does not want to run Cassandra, it is the lower-effort choice for write-heavy key-access at scale. I pick Cassandra when I need to self-host (no managed option, on-prem, multi-cloud), want fine control over consistency and compaction, or need a specific multi-DC topology. They solve a similar problem; the deciding factor is usually operational ownership.

Worked example

Setup. Design the message-storage layer for a global chat app: 1M messages/sec written worldwide, billions stored, "load the latest 50 messages in a channel" must be fast, and writes must keep flowing even if a whole region goes down.

The move. This is a write firehose with known per-channel reads and an HA requirement — squarely Cassandra. The table is keyed PRIMARY KEY ((channel_id, day_bucket), message_id): partition by channel plus a day bucket so a busy channel doesn't create an unbounded partition, and cluster by message_id so "latest 50" is a single ordered partition read with ORDER BY message_id DESC LIMIT 50. The append-only LSM engine swallows the 1M/sec writes by adding nodes linearly — no single primary to bottleneck.

Consistency + multi-region. Replication factor 3 per datacenter; writes use `LOCAL_QUORUM` (2 of 3 in the local DC) so they're strongly consistent in-region and fast, replicating asynchronously to other regions. Because the ring is masterless, there is no failover event — a region can vanish and the others keep accepting writes. I cite W + R > N when I need a guaranteed-fresh read.

Bounding partitions. The day-bucket in the key is the critical modeling decision: without it, a channel with millions of messages becomes a multi-GB partition that is slow to read and painful to compact. With it, each partition is one day of one channel — bounded and cheap.

What breaks. Tombstones. If I let users delete messages heavily, deletes write tombstones (not removals) that pile up and slow reads until compaction past the GC grace period. I keep deletes rare, use TTLs for ephemeral data, and pick time-window compaction for this time-series shape. The second gap is non-key access (full-text search across messages) — that is not Cassandra's job, so I feed a search index (Elasticsearch) from the write path rather than reaching for ALLOW FILTERING.

The result. 1M writes/sec absorbed by adding commodity nodes, billions of messages stored, "latest N in a channel" served as one fast partition read, and writes that survive a full regional outage with no failover — the workload Cassandra was built for.

Cheat sheet

•Masterless wide-column store: no leader, append-only LSM, linear scale by adding nodes.
•Built for write-heavy + high-availability + multi-region. AP by default, tunable consistency.
•Model query-first: partition key = placement, clustering key = order. One table per access pattern.
•Bound partitions — bucket by time/hash so none grows unbounded (the #1 mistake).
•Consistency per query: ONE → LOCAL_QUORUM → QUORUM → ALL. W + R > N for strong reads.
•RF 3 per DC standard; LOCAL_QUORUM for fast strong multi-DC writes.
•Deletes = tombstones (not removals) → avoid delete-heavy / queue patterns. Use TTL.
•No joins, no ad-hoc filters (ALLOW FILTERING = scan). Pair with search/warehouse for those.

Drills

Why does Cassandra handle writes that overwhelm a relational database?Reveal

Its LSM-tree write path is append-only: a write hits the commit log (sequential) and an in-memory memtable, with no read-before-write and no random disk I/O, then memtables flush to immutable SSTables in the background. A B-tree relational engine does in-place updates with random I/O and page splits, which become the bottleneck under heavy write load. Combined with the masterless ring (no single primary to funnel writes through), Cassandra scales writes linearly by adding nodes.

Interviewer: "your Cassandra reads got slow over time. Why might that be?"Reveal

Most likely wide partitions or tombstone build-up. If a partition key accumulates rows without bound, reads scan an enormous partition — fix by bucketing the key. If the workload deletes or updates heavily, tombstones accumulate and reads waste time scanning and skipping dead rows until compaction reclaims them — fix by avoiding delete-heavy patterns, using TTLs, and choosing time-window compaction. Read amplification across many un-compacted SSTables is a third cause, addressed by tuning the compaction strategy.

You set RF=3 and write with consistency ONE. What did you trade away?Reveal

Strong consistency. Writing with ONE returns after a single replica acknowledges, so it is fast and highly available — but a subsequent read at ONE might hit a replica that has not yet received the write, returning stale data. To guarantee a read sees the latest write you need W + R > N: with N=3 that means QUORUM (2) on both. ONE/ONE maximises speed and availability; you accept eventual consistency and rely on read-repair and anti-entropy to converge replicas over time.

What it is

Apache Cassandra

Also worth naming: ScyllaDB (C++ rewrite, faster) · DataStax Enterprise · Amazon Keyspaces (managed Cassandra API)

~25 min read·15 sections

Dimension

Number

Why it matters

Write throughput

~10K–50K+ writes/sec per node, linear

Append-only LSM; add nodes for more — the headline strength

Scaling

Linear with node count

Double the nodes ≈ double the capacity, no resharding event

Availability

No SPOF; survives node & region loss

Masterless + replication; AP under partition

Consistency

Tunable per query (ONE → QUORUM → ALL)

W + R > N for strong; trade for latency/availability

Read cost

Can touch multiple SSTables + tombstones

Why range deletes and wide partitions hurt reads

Multi-region

Native multi-DC replication

LOCAL_QUORUM keeps writes fast in-region

Dimension

Cassandra

DynamoDB

PostgreSQL

Topology

Masterless ring, you operate it

Managed, serverless

Single primary + replicas

Best at

Write-heavy, multi-region, HA

Managed key-access at scale

Relational + transactions

Consistency

Tunable (AP by default)

Eventual or strong (per read)

Strong / ACID

Queries

Per-partition, table-per-query

Key-based, pre-designed

Ad-hoc SQL, joins

Ops burden

High (cluster tuning, compaction)

None

Moderate