Quickstart#
This section is for ML Engineers who need to run jobs on their k8s cluster. They will need a k8s admin to assign them a namespace
and queue for submitting jobs as well as a k8s config file to place into ~/.kube/config
to give access to the cluster. Using Skypilot is going to be the easiest way to submit jobs. We maintain a fork of the original project to support Kueue for multi-node workloads. We are currently trying to get this upstreamed here
In managed Konduktor, we by default provision one queue
user-queue
with two priority classes
low-priority
high-priority
Setup#
# install trainy-skypilot
$ conda create -n konduktor python=3.10
$ conda activate konduktor
$ pip install "trainy-skypilot-nightly[kubernetes]"
# if training on Trainy managed platform
$ pip install trainy-policy-nightly
# check that k8s credentials work
$ sky check
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled
and create the following in ~/.sky/config.yaml
# for training on Trainy managed platform
admin_policy: trainy.policy.GKEPolicy
kubernetes:
autoscaler: gke # for training on Trainy managed platform
provision_timeout: 600 # how long to wait for job to be scheduled, set to -1 to allow waiting indefinitely, necessary for managed jobs
remote_identity: SERVICE_ACCOUNT
Hello Konduktor#
To create a development environment, let’s first define our resource request as task.yaml
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
accelerators: H100-MEGA-80GB:8
cpus: 192+
memory: 1000+
cloud: kubernetes
# (optional) use resource queues if cluster admin set them
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: low-priority # can be either low-priority, high-priority, if omitted defaults to low-priority
max-run-duration-seconds: "3000" # required to run
The kueue.x-k8s.io
and max-run-duration-seconds labels are required in order to run
if your cluster admin created resource queues. To issue this request run:
# create a request
$ sky launch -c dev task.yaml
# login to dev container
$ ssh dev
# list running clusters
$ sky status
# tear down cluster once you are down using it
$ sky down dev
Distributed Jobs#
To scale up the job size over multiple nodes, we just change task.yaml
to specify num_nodes
.
We define a script for each node to run by using the setup
and run
sections.
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
accelerators: H100-MEGA-80GB:8
cpus: 192+
memory: 1000+
cloud: kubernetes
# (optional) use resource queues if cluster admin set them
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: high-priority # this will preempt low-priority jobs
max-run-duration-seconds: "3000" # required to run
num_nodes: 2
setup: |
git clone https://github.com/roanakb/pytorch-distributed-resnet
cd pytorch-distributed-resnet
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
run: |
cd pytorch-distributed-resnet
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
and run with
# create a job that runs in the background
$ sky jobs launch -d -c distributed --detach-run task.yaml
# show the status of all existing jobs
$ sky jobs queue
# cancel a running or pending job
$ sky jobs cancel <JOB_ID>
This will create a managed job that will run in the background to completion.
For a more thorough explanation of all of Skypilot’s capabilities, please refer to the documentation and examples. Below are a series of links to explain some of the commonly used capabilities of Skypilot relevant for running batch/training jobs.
Warning
Using the managed jobs controller via sky jobs launch currently requires cloud access with object storage. Using the managed job controller with only Kubernetes credentials is still work in process.