Getting Started

Getting Started#

To get access to your Trainy: Konduktor clusters, work with your account manager to add all devices that will require cluster access as client configurations in ~/.sky/config.yaml. Trainy provides isolated access to clusters via Tailscale, which you will need to install on your development machine. You can see and connect to the clusters you have access to with:

# list clusters
$ tailscale status
100.85.126.7  awesomecorp-laptop   awesomecorp-laptop.taila1933c.ts.net macOS  -
100.95.60.42  awesomecorp-gke1     tagged-devices linux  idle, tx 39656 rx 1038824
100.90.169.2  awesomecorp-gke2     tagged-devices linux  -

# configure connection to a cluster
$ tailscale configure kubeconfig awesomecorp-gke1

# check that k8s credentials work
$ sky check
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled

Once you are connected, you can start running jobs <usage/quickstart.html> on your cluster.

Node Specifications#

Trainy managed Konduktor comes with clusters preconfigured and validated with the right drivers and software for running workloads on GPUs enabled with high-performance networking so you can start training at scale without having to configure, autoscale, upgrade GPU infrastructure. The following clouds support autoscaling:

  • GCP (a3-ultragpu), H100-80GB-MEGA:8, 1.6Tbps, 192vCPUs, 1TB RAM, 2TB disk - Available

  • AWS on-demand/spot support, H100:8 3.2Tbps - In progress 🚧

  • Azure on-demand/spot support, H100:8 3.2Tbps - In progress 🚧

On our autoscaling clusters, for now we only support H100:8 or H100-80GB-MEGA:8 instances, which can be requested as.

num_nodes: 2 # scale up number of nodes

resources:
    image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3 # specify your image
    accelerators: H100-80GB-MEGA:8 # specify the right gpu type
    cpus: 192+ # 192 CPUs
    memory: 1000+ # 1TB of RAM
    cloud: kubernetes
    labels:
        kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
        kueue.x-k8s.io/priority-class: low-priority

Warning

Trainy instances are ephemeral and will be autoscaled down in 10 minutes of idling. Be sure if you are running stateful applications like model training to instrument your application to regularly retrieve and back up to object storage (S3, GCS, Azure Blob, Cloudflare R2, etc.)