Welcome to Konduktor’s documentation!

Welcome to Konduktor’s documentation!#

Trainy
Trainy

Star

Batch Jobs and Cluster Management for GPUs on Kubernetes

Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. This documentation is targeted towards:

  • ML Engineers/researchers trying to launch training jobs on Konduktor, either managed by Trainy or self-hosted

  • GPU cluster administrators trying to self-host Konduktor

For interest in our managed offering, please contact us at support@trainy.ai

Key Features#

  • 🚀 Easily scale out and job queueing and multi-node scheduling

# create a request
$ sky launch -c dev task.yaml --num-nodes 100
  • ☁ Multi-cloud access

# toggle cluster via region
$ sky launch -c dev task.yaml --region gke-cluster
  • Custom container support

# task.yaml
resources:
   image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3

run: |
   python train.py

Managed Features and Roadmap#

  • On-prem/reserved support - Available

  • GCP on-demand/spot support - Available

  • AWS on-demand/spot support - In progress 🚧

  • Azure on-demand/spot support - In progress 🚧

  • Multi-cluster submission - In progress 🚧

Documentation#

Self-hosted Cluster Administration