System Design Question

Design a Distributed Job Scheduler

bugfree Icon

Hello, I am bugfree Assistant. Feel free to ask me for any question related to this problem

Functional Requirements:

  • Users can schedule jobs to run at a specific time (one-time jobs).
  • Users can schedule recurring jobs (e.g., using cron expressions).
  • Users can view the status of their scheduled jobs (pending, running, completed, failed).
  • Users can cancel a scheduled job before it starts.
  • The system should support basic job dependencies (e.g., job B runs after job A completes).
  • The system should allow job payloads (e.g., command to execute, script, or HTTP callback).

Non-Functional Requirements:

  • The system should be highly available and tolerate failures of individual nodes.
  • The scheduler should be horizontally scalable to handle increasing job volume.
  • Job execution should be reliable, with retries on transient failures.
  • The system should ensure that each job is executed at least once (at-least-once delivery).
  • Job scheduling and execution should have low latency (jobs run close to their scheduled time).
  • The system should persist job data for audit and debugging purposes.
  • Basic security: only authenticated users can schedule or manage their jobs.

System Design Diagrams

Zoom In and Out via trackpad or posture