Job Priorities and Pre-emption

Job Priorities and Pre-emption#

Job priority allows teams to enqueue development workloads, while enabling users to preempt lower priority resources to free up resources for mission critical high priority workloads. This page explains to use job priorities with Kueue and Skypilot.

This tutorial requires that you install:

  • Trainy skypilot: pip install "trainy-skypilot-nightly[kubernetes]"

  • kubectl

Example: Using Skypilot with Kueue Priorities#

Assuming your cluster administrator has provisioned GPU instances and given you quota within your cluster, you can request GPUs by specifying.

  • Workload queue: kueue.x-k8s.io/queue-name: user-queue

  • Workload priority: kueue.x-k8s.io/priority-class: low-priority

Let’s define a request for a single T4:4 instance.

# low.yaml
resources:
    accelerators: T4:4
    labels:
        kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
        kueue.x-k8s.io/priority-class: low-priority # specify low priority workload

run: |
    echo "hi i'm a low priority job"
    sleep 1000000

and now you can launch the request

# launch a low priority task
$ sky launch -y -d -c low task.yaml


# list workloads in kueue
$ kubectl get workloads
NAME                 QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
low-3ce1             user-queue                                         5m

While this workload is running we can enqueue another higher-priority task. If there is room in the cluster to fulfill the higher priority workload by preempting lower priority jobs work, Kueue will delete the lower priority workloads and launch the higher priority ones instead.

# high.yaml
resources:
    accelerators: T4:4
    labels:
        kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
        kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload

run: |
    echo "hi i'm a high priority job"
    sleep 1000000

and now you can launch the request but now with high priority with.

# launch a development cluster
$ sky launch -y -d -c high high.yaml


# list workloads in kueue
$ kubectl get workloads
NAME                 QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
high-3ce1            user-queue                                         2m

Tip

Pre-empted tasks are not by default requeued if you use sky launch. To have your jobs retried, we recommend using sky jobs launch instead so that when a task is pre-empted, the skypilot job controller will automatically resubmit your task to Kueue without manual intervention.

References#