Loading…

Design a Distributed Job Scheduler - SystemRound

Design a Distributed Job Scheduler

hardSystem design45 min5 stagesPro

Asked atAmazonGoogleUberAirbnbStripeCloudflareDatadog

Design a distributed job scheduler (cron-as-a-service). Users submit one-time and recurring jobs that must fire close to their scheduled time and execute via a worker fleet — without being lost, without silently double-running, and surviving worker and scheduler crashes. The hard parts are finding which jobs are due efficiently at scale and guaranteeing once-ish execution despite failures and round-time spikes when millions of jobs are all scheduled for the top of the hour. Scale: 100M+ jobs, ~10K submissions/sec.

Best after a few full reps. Expect follow-up questions, edge cases, and deeper trade-off discussion.

What this problem tests

Submit & Schedule SpecCancel / StatusIdempotent SubmitCore ComponentsScheduler → Queue → WorkersExactly-One Owner

Round shape

5 stages

Time budget

45 min

Feedback loop

Grade anytime

Guided practice·Primary loop

Guided practice

Workspace-first, hints visible, stage retry available. The cheap, repeatable loop — build the answer shape before you take it under pressure.

Stage-by-stage workspace instead of a blank page.
Grade one stage or the whole answer whenever you want.
Compare your reasoning against reference criteria and model answers.

Solve once, compare against the checklist, then come back to the weak stage instead of starting over.

Mock interview·Pressure test

Mock interview

Strict timer, hints hidden, debrief deferred to the end. Use this once you can already structure a clean answer and want to pressure-test pacing and pushback.

Best once the answer shape is already in your head.
Pressure-test pacing, pushback handling, and communication.
Use diagnosis after the interview for exact misses and next study steps.

Best after one structured rep · timed · focused on pacing and communication.

Requirements

This is the framing pass. A strong answer quickly defines what the system must do, what quality bar it has to hit, and the numbers that will justify the rest of the design.

First 5 min of the round

What must exist

Functional Requirements

6 items

1Submit a one-time job (run once at a future timestamp)

2Submit a recurring cron job that reschedules its next occurrence after firing

3Trigger each job at its scheduled time and execute it via workers

4Retry failed executions with backoff; dead-letter after a limit

5Cancel/update a pending job

6Below the line: DAG workflows, arbitrary in-process code, priority tiers

What good looks like

Non-Functional Requirements

4 items

1At-least-once execution — a job must never be silently lost

2Exactly-once is impractical end to end, so handlers are idempotent

3Fire within a few seconds of the scheduled time (p99 < ~5s late)

4A single scheduled run has exactly one owner — no concurrent double-run

Numbers to anchor the design

Scale Estimation

3 items

1100M+ scheduled jobs; ~10K submissions/sec

2Due rate: tens of thousands/sec average, but a huge spike at round times (top of the hour, midnight) — a thundering herd

3Storage: 100M × ~1 KB ≈ ~100 GB (modest) — the due index is the hot path, not raw storage

How the round unfolds

Each stage has a distinct job. Treat them like separate deliverables instead of one giant answer, and the round becomes much easier to navigate.

4 design stages · 40 pts after framing

🔌

Stage 2~5 min10 pts

API Design

Define the contract clearly: the endpoints, auth boundary, error semantics, and the one or two decisions that matter most.

What you should produce

Define the interface. How does a user submit a job (with its schedule), cancel it, and check status? What stops a retried submit from creating dupl...

Strong answers cover

Submit & Schedule SpecCancel / StatusIdempotent Submit

🏗️

Stage 3~10 min10 pts

High-Level Architecture

Lay out the main components and trace the write path, read path, and any async path cleanly.

What you should walk through

Walk me through the architecture.

Strong answers cover

Core ComponentsScheduler → Queue → WorkersExactly-One Owner

💾

Stage 4~10 min10 pts

Data Model & Storage

Pick the store, show the schema or key model, and explain why that storage choice fits the access pattern.

What you should lock in

Get concrete about storage.

Strong answers cover

Store & SchemaDue Query (time index)Recurring Jobs

📈

Stage 5~15 min10 pts

Scaling & Deep Dive

Name the first bottleneck, failure modes, and the trade-offs that keep the system fast and reliable under pressure.

What you should pressure-test

Now the deep dive. Two crux topics: the round-time thundering herd, and guaranteeing once-ish execution despite crashes. Then cover scheduler failo...

Strong answers cover

Thundering HerdOnce-ish ExecutionFailover, Retries & Clocks

What a strong first rep looks like

Scope clearly

Translate the prompt into concrete requirements, scale, and trade-offs before drawing architecture.

Stay stage-specific

Give APIs in the API stage, data models in the storage stage, and failure modes in scaling. Don't blur them together.

Iterate fast

Grade early, compare to the reference reasoning criteria, fix the biggest misses, and re-submit the weak stage instead of starting over.