Quickstart#
This section is for ML Engineers who need to run jobs on their k8s cluster. They will need a k8s admin to assign them a namespace
and queue for submitting jobs as well as a k8s config file to place into ~/.kube/config
to give access to the cluster. Using Skypilot is going to be the easiest way to submit jobs. We maintain a fork of the original project to support Kueue for multi-node workloads. We are currently trying to get this upstreamed here
In managed Konduktor, we by default provision one queue
user-queue
with two priority classes
low-priority
high-priority
Setup#
# install trainy-skypilot
$ conda create -n konduktor python=3.10
$ pip install "trainy-skypilot-nightly[kubernetes]"
# check that k8s credentials work
$ sky check
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled
and create the following in ~/.sky/config.yaml
kubernetes:
remote_identity: SERVICE_ACCOUNT
provision_timeout: -1
Hello Konduktor#
To create a development environment, let’s first define our resource request as task.yaml
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
accelerators: T4:4
cpus: 8+
memory: 8+
cloud: kubernetes
# (optional) use resource queues if cluster admin set them
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: low-priority
The kueue.x-k8s.io
labels are required in order to run if your cluster admin created resource queues.
To issue this request run:
# create a request
$ sky launch -c dev task.yaml
# login to dev container
$ ssh dev
# list running clusters
$ sky status
# tear down cluster once you are down using it
$ sky down dev
Distributed Jobs#
To scale up the job size over multiple nodes, we just change task.yaml
to specify num_nodes
.
We define a script for each node to run by using the setup
and run
sections.
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
accelerators: T4:4
cpus: 8+
memory: 8+
cloud: kubernetes
# (optional) use resource queues if cluster admin set them
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: high-priority # this will preempt low-priority jobs
num_nodes: 2
setup: |
git clone https://github.com/roanakb/pytorch-distributed-resnet
cd pytorch-distributed-resnet
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
run: |
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
and run with
# create a job that runs in the background
$ sky jobs launch -d -c distributed --detach-run task.yaml
# show the status of all existing jobs
$ sky jobs queue
# cancel a running or pending job
$ sky jobs cancel <JOB_ID>
This will create a managed job that will run in the background to completion.
For a more thorough explanation of all of Skypilot’s capabilities, please refer to the documentation and examples. Below are a series of links to explain some of the commonly used capabilities of Skypilot relevant for running batch/training jobs.