advanceddeep dive

Active-active multi-region

Global writes, conflict resolution, data residency, failover, and blast-radius trade-offs.

~15 min read

Active-active is not just active-passive with two active sides — it's a whole conflict-resolution design. You either pick partitioning, last-write-wins, or CRDTs. There is no fourth option.

Read this if your last attempt…

You said "we'll go multi-region" without a conflict-resolution plan
You don't know the difference between active-passive and active-active
You can't explain what CRDTs solve
You haven't considered cross-region write latency

The concept

Start by distinguishing the three postures:

Single region — one region serves everything. Other regions (if any) are DR-only, rarely exercised, often broken when needed. Many production systems.
Active-passive — primary region takes all writes; secondary has a replica. On failover, secondary is promoted. RTO minutes to hours; RPO seconds to minutes of replication lag. Simpler to reason about; failover is an event.
Active-active — both regions take writes simultaneously. Cross-region replication + conflict resolution is mandatory. Lower RTO but much more complex.

Architecture diagram· Active-active with partitioning by user

Users routed to home region. Home writes stay local. Replication to other regions is async for reads.

Posture trade-off.

Posture	RTO	Cost	Complexity
Single region	Region outage = site outage	1×	Lowest
Active-passive	Minutes (failover)	1.5–2× (warm standby)	Moderate — practise the failover
Active-active (partitioned)	Seconds (drain + reroute)	2× + replication	High — ops muscle required
Active-active (LWW conflict)	Seconds	2×+	Very high + silent data loss risk

How interviewers grade this

You state the posture (single / active-passive / active-active) and the RTO/RPO.
If active-active, you name the conflict-resolution strategy.
You keep writes in-region; cross-region is async.
You name the routing mechanism (GeoDNS or anycast).
You cost the added operational complexity honestly.

Variants

Active-active partitioned by user

Each user has a home region; all their writes go there.

The WhatsApp / Slack / most-large-scale pattern. Conflicts impossible because only one region ever writes a given record. Cross-region reads happen but are asynchronous. Migration between regions is an explicit operation.

Pros

+No write conflicts by construction
+Low latency for in-region writes
+Bounded blast radius per region

Cons

−Cross-region access is slower or eventually consistent
−Home-region assignment is a product decision
−Migrating a user between regions is non-trivial

Choose this variant when

User-scoped data dominates (messaging, social, most SaaS)
Large scale requires independent regional scaling

Active-passive

One region primary; one region warm standby.

Simpler than active-active. Primary takes writes; standby replicates asynchronously. Failover is a runbook event (minutes). RPO is replication lag (typically seconds). Good default for "need DR" without the ops bill.

Pros

+Simple reasoning; no conflicts
+Standard disaster-recovery story
+Half the complexity of AA

Cons

−Some downtime during failover
−Standby capacity is idle (partially)
−Failover must be practised or it won't work

Choose this variant when

DR is the primary motivation
Single-region latency is acceptable
Team isn't ready for AA ops

CRDT-based active-active

Both regions write; merges are mathematically conflict-free for specific data types.

Counters, sets, maps-with-LWW values can be CRDTs. Collaborative editing (Figma, Automerge) uses fancier CRDTs. Hard for general-purpose data; most business data doesn't fit cleanly. Redis Enterprise and Riak support them.

Pros

+No write conflicts for CRDT-shaped data
+Partition-tolerant
+Elegant when it fits

Cons

−Only specific data shapes work
−Operational complexity
−Tombstones accumulate (GC is non-trivial)

Choose this variant when

Counters, sets, collaborative docs
Workloads where the data shape is naturally CRDT-y

Worked example

Design: messaging app with 100M users, global audience.

Posture: active-active, partitioned by user_id → home region.

Routing:

GeoDNS returns nearest front-door to user.
Front-door looks up user_id → home_region in a cached table; forwards to home region.
Cross-region redirects happen <1% of the time (user travelling); acceptable.

Write path: user's message goes to their home region's primary DB. Acknowledged locally (~5 ms).

Replication: home region async-replicates to other regions (Kafka cross-region). Other regions use the replicated copy for read-only (if a recipient in another region reads the sender's message). Replication lag typically 100–500 ms.

Failover: if a region fails, its users' home region is temporarily re-pointed to a peer region via a control-plane update. Reads continue; writes are buffered and flushed when home comes back. Failover is a runbook event, practised quarterly.

Conflict resolution: not needed — only one region writes a given user's data.

Cost: ~3× single-region cost (3 regions, each with enough capacity to absorb a peer's traffic).

Key honesty: this is a significant operational commitment. Team needs 24/7 SRE, runbooks, cross-region observability, failover drills. Not a decision taken lightly.

Good vs bad answer

Interviewer probe

“Design this for 3-region active-active.”

Weak answer

"Deploy to 3 regions, use a multi-region DB, done."

Strong answer

"Partition users by home region — each user's writes go to one region only, so there are no write conflicts by construction. GeoDNS routes requests to the nearest front-door; front-door proxies cross-region if the user isn't at home (~1% of requests). Async replication to peer regions gives read fan-out. For DR, a failed region's users fail over to a peer via a control-plane update — tested quarterly. Alternative for shared data: CRDTs if the shape fits, last-write-wins only for low-value data because it silently loses writes. The honest cost: 3× infra + significant SRE muscle. If the business doesn't need regional failover, active-passive at half the complexity is the right call."

Why it wins: States partitioning strategy, routing, failover, alternatives, and the real cost — ops as much as infra.

Interview playbook3–4 min when global reach or regional failover is a stated requirement

When it comes up

Global user base and "serve everyone with low latency"
A requirement to survive a full region outage
Data-residency / sovereignty constraints (EU data stays in EU)
The interviewer says "make it multi-region"

Order of reveal

1
1. State the posture. Single region, active-passive, or active-active — and the RTO/RPO each implies. I do not say "multi-region" without picking one.
2
2. Conflict strategy for AA. Active-active needs conflict resolution. Partitioning users by home region removes conflicts by construction — my default.
3
3. Keep writes local. A user’s writes go to their home region; cross-region replication is async for read fan-out.
4
4. Routing. GeoDNS for nearest-region routing in most cases; anycast / Global Accelerator when failover must be near-instant.
5
5. Cost honesty. Active-active is roughly 2–3× infra plus real SRE muscle — if the business only needs DR, active-passive at half the complexity is the right call.

Signature phrases

“Active-active is a conflict-resolution design, not just two active sides.”

“Partition users by home region and write conflicts vanish by construction.”

“Last-write-wins silently loses data — never for important writes.”

“Cross-region RTT is 80–100 ms; I design writes to stay local.”

“Active-active is a conflict-resolution design, not just two active sides.” — Shows you know where the real complexity lives.
“Partition users by home region and write conflicts vanish by construction.” — Names the cleanest active-active pattern.
“Last-write-wins silently loses data — never for important writes.” — Flags the trap interviewers probe on.
“Cross-region RTT is 80–100 ms; I design writes to stay local.” — Demonstrates you respect the speed of light.

Likely follow-ups

?“Your cloud DB advertises multi-master active-active. Are conflicts solved?”Reveal

No. The DB handles replication and low-level conflict detection (timestamps, vector clocks), but not your business policy for what happens when two regions update the same row. You still choose: partition so it cannot happen, last-write-wins (accepting silent loss), CRDTs (if the data shape fits), or an application merge. The DB’s default resolution is almost never what the product actually wants.

?“Why not just use last-write-wins for everything?”Reveal

Because clocks drift and the loss is silent. Region A commits at 12:00:00.100 and region B at 12:00:00.099 with different data; B wins by timestamp and A’s write vanishes with no error — the user watches their last edit disappear. That is acceptable for a cache or ephemeral counter, never for messages, orders, or anything auditable. Those need partitioning or CRDTs.

?“Your active-passive RTO is 10 minutes — how much data can you lose?”Reveal

That is the RPO, which is your replication lag, separate from the RTO. With async replication you can lose seconds to tens of seconds of in-flight writes; semi-sync narrows it to milliseconds; only synchronous cross-region replication gives RPO zero — at a steep latency cost. RTO (10 min) is time-to-switch-over; RPO is how-much-was-in-flight. Both are non-zero unless you pay for synchronous replication.

Common mistakes

Active-active with LWW for important data

Two regions write the same row; the loser's write is silently dropped. Fine for ephemeral cache; unacceptable for payments, messages, anything you'd miss. Use partitioning or CRDTs.

Untested failover

Active-passive DR that's never been exercised doesn't work — configs drift, replication breaks, credentials expire. Fail over quarterly, in production, as a practice.

Synchronous cross-region calls on the hot path

A 100 ms trans-Atlantic RTT on every login is a UX death. Design to keep writes and primary reads in-region.

Assuming the cloud DB solves itAdvanced

AWS Aurora Global, DynamoDB Global Tables, etc. handle replication — they do NOT handle conflict resolution policy for your business semantics. You still have to decide.

Practice drills

Your cloud DB advertises "multi-master active-active". Are your conflict problems solved?Reveal

No. The DB solves replication and low-level conflict detection (timestamps, vector clocks). Your business still has to decide what happens when two regions update the same row. The product needs a policy — partition, LWW, merge, whatever — and the DB's default is almost never what you want.

Interviewer: "why not just LWW for all active-active writes?"Reveal

Because clocks lie and losses are silent. Region A commits at 12:00:00.100; region B commits at 12:00:00.099 with slightly different data; B "wins" and A is silently dropped. User sees their last edit disappear. For caches and unimportant state LWW is fine. For messages, orders, anything auditable: no.

Active-passive RTO is 10 minutes. How much data might you lose?Reveal

RPO — replication lag. Typically seconds to tens of seconds if async; milliseconds if semi-sync; zero if sync (but sync across regions is slow). The 10-min RTO is time-to-switch; RPO is how-much-in-flight-data-was-lost. Often these are confused — both are non-zero unless you pay for sync cross-region replication.

Cheat sheet

•Default: single region. Add regions only when DR or latency requires it.
•Active-passive is the simpler DR posture. Test failover.
•Active-active partition by user is the "no-conflict" AA pattern.
•LWW silently loses data. Don't use for important writes.
•CRDTs when the data shape fits (counters, sets, collab docs).
•Cross-region RTT is 80–100+ ms. Design writes to stay local.
•Operational cost > infra cost. Honest team-readiness assessment.

Practice this skill

No problem is tagged directly to Active-active multi-region yet. These published problems still exercise the same interview category.

webhook delivery notification service rate limiter

Read this if

Posture

RTO

Cost

Complexity

Single region

Region outage = site outage

1×

Lowest

Active-passive

Minutes (failover)

1.5–2× (warm standby)

Moderate — practise the failover

Active-active (partitioned)

Seconds (drain + reroute)

2× + replication

High — ops muscle required

Active-active (LWW conflict)

Seconds

2×+

Very high + silent data loss risk