Quickstart#

This section is for ML Engineers who need to run jobs on their k8s cluster. They will need a k8s admin to assign them a namespace and queue for submitting jobs as well as a k8s config file to place into ~/.kube/config to give access to the cluster. Using Skypilot is going to be the easiest way to submit jobs. We maintain a fork of the original project to support Kueue for multi-node workloads. We are currently trying to get this upstreamed here

In managed Konduktor, we by default provision one queue

  • user-queue

with two priority classes

  • low-priority

  • high-priority

Setup#

# install trainy-skypilot
$ conda create -n konduktor python=3.10
$ conda activate konduktor
$ pip install "trainy-skypilot-nightly[kubernetes]"

# if training on Trainy managed platform
$ pip install trainy-policy-nightly

# check that k8s credentials work
$ sky check
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled

and create the following in ~/.sky/config.yaml

# for training on Trainy managed platform
admin_policy: trainy.policy.GKEPolicy

kubernetes:
    autoscaler: gke  # for training on Trainy managed platform
    provision_timeout: 600 # how long to wait for job to be scheduled, set to -1 to allow waiting indefinitely, necessary for managed jobs
    remote_identity: SERVICE_ACCOUNT

Hello Konduktor#

To create a development environment, let’s first define our resource request as task.yaml

resources:
    image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
    accelerators: H100-MEGA-80GB:8
    cpus: 192+
    memory: 1000+
    cloud: kubernetes
    # (optional) use resource queues if cluster admin set them
    labels:
        kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
        kueue.x-k8s.io/priority-class: low-priority # can be either low-priority, high-priority, if omitted defaults to low-priority
        max-run-duration-seconds: "3000" # required to run

The kueue.x-k8s.io and max-run-duration-seconds labels are required in order to run if your cluster admin created resource queues. To issue this request run:

# create a request
$ sky launch -c dev task.yaml

# login to dev container
$ ssh dev

# list running clusters
$ sky status

# tear down cluster once you are down using it
$ sky down dev

Distributed Jobs#

To scale up the job size over multiple nodes, we just change task.yaml to specify num_nodes. We define a script for each node to run by using the setup and run sections.

resources:
    image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
    accelerators: H100-MEGA-80GB:8
    cpus: 192+
    memory: 1000+
    cloud: kubernetes
    # (optional) use resource queues if cluster admin set them
    labels:
        kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
        kueue.x-k8s.io/priority-class: high-priority # this will preempt low-priority jobs
        max-run-duration-seconds: "3000" # required to run

num_nodes: 2

setup: |
    git clone https://github.com/roanakb/pytorch-distributed-resnet
    cd pytorch-distributed-resnet
    mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf cifar-10-python.tar.gz

run: |
    cd pytorch-distributed-resnet
    num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
    master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
    python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
    --master_port=8008 resnet_ddp.py --num_epochs 20

and run with

# create a job that runs in the background
$ sky jobs launch -d -c distributed --detach-run task.yaml

# show the status of all existing jobs
$ sky jobs queue

# cancel a running or pending job
$ sky jobs cancel <JOB_ID>

This will create a managed job that will run in the background to completion.

For a more thorough explanation of all of Skypilot’s capabilities, please refer to the documentation and examples. Below are a series of links to explain some of the commonly used capabilities of Skypilot relevant for running batch/training jobs.

Warning

Using the managed jobs controller via sky jobs launch currently requires cloud access with object storage. Using the managed job controller with only Kubernetes credentials is still work in process.

Skypilot Reference#