Weak on reliability
Every senior round pressure-tests "what fails". This path builds a per-component failure vocabulary and the HA playbook to match.
For: Engineers whose feedback cites "hand-wavy on failure" or "no DR story"
After this path
Name a failure mode for each component, a mitigation for each, and an availability target + topology that matches.
- 1Skill
Failure mode analysis
What fails, blast radius, graceful degradation, retries, circuit breakers.
Why this, here: The framework. "What if this dies?" per component.
- 2Skill
Replication & durability
Leader/follower, sync vs async replication, write quorum, RPO/RTO.
Why this, here: Quorum math — what actually survives N failures.
- 3Deep dive
Consensus and leader election
Raft-style coordination, leases, fencing tokens, quorum, and exactly-one leadership.
Why this, here: When exactly-one-of-N must do the thing. Raft / etcd, not DIY.
Checkpoint
Rehearsal: for a system that needs exactly-one-of-N to do a scheduled job, which primitive do you reach for and why? If your answer is “I’ll elect a leader somehow”, re-read — the name is the point.
- 4Pattern
High Availability
Redundancy + graceful degradation + operational discipline. You don't buy 99.99% — you earn it.
Why this, here: Redundancy + degradation + discipline — the three pillars.
- 5Pattern
Multi-region active-passive / active-active
Geographic distribution for latency, DR, and compliance. Active-passive is operationally sane; active-active is a conflict-resolution project.
Why this, here: Active-passive vs active-active. RTO/RPO framing.
Checkpoint
Pick one: active-active or active-passive for a global URL shortener. Name the RTO, the RPO, and the one thing that breaks first when a region dies.
- 6Skill
Observability & operations
Metrics, logs, traces, SLOs, alerting on symptoms not causes.
Why this, here: You can't fix what you can't see. SLOs, error budgets, the four golden signals.