Architecture

Architecture#

Trainy

Konduktor was built with the following objectives in mind.

  1. ML Engineers who can train on multi-gpu already should be able to scale across nodes with little to no code changes and bring their favorite frameworks (PyTorch, Lightning, HuggingFace, Deepspeed, etc.)

  2. Support multi-tenancy and resource sharing via quotas

  3. Observability and Auto-healing to gracefully handle GPU/hardware errors

which led to us building on Kubernetes as well as integrating with the following tools.

  1. SkyPilot - supports easy scaleout over nodes with declarative resource requests

    resources:
        accelerators: H100:8
    
    num_nodes: 100
    
    run: |
        num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
        master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
        python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
        --nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
        --master_port=8008 resnet_ddp.py --num_epochs 20
    
  2. Kueue - declarative resource quotas, sharing, and job pre-emption via workload queues/priorities

    • ML Engineers only have to specify which queues they want to submit to
      resources:
          accelerators: H100:8
          labels:
              kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
              kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload
      
      num_nodes: 100
      
      run: |
          num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
          master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
          python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
          --nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
          --master_port=8008 resnet_ddp.py --num_epochs 20
      
    • Cluster administrators can set GPU quotas by team via resource flavors and queues.
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ResourceFlavor
      metadata:
      name: "default-flavor"
      ---
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ClusterQueue
      metadata:
      name: "cluster-queue"
      spec:
      preemption:
          reclaimWithinCohort: Any
          borrowWithinCohort:
          policy: LowerPriority
          maxPriorityThreshold: 100
          withinClusterQueue: LowerPriority
      namespaceSelector: {} # match all.
      resourceGroups:
      - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
          flavors:
          - name: "default-flavor"
          resources:
          - name: "cpu"
              nominalQuota: 10000
          - name: "memory"
              nominalQuota: 10000Gi
          - name: "nvidia.com/gpu"
              nominalQuota: 8 # REPLACE THIS
      ---
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: LocalQueue
      metadata:
      name: "user-queue"
      spec:
      clusterQueue: "cluster-queue"
      
  3. Prometheus, Grafana, OpenTelemetry, Loki: Exporters for both GPU metrics and kernel logs as well as metrics and logging backends for longer term storage and querying. Health controllers run in a loop to detect and cordon faulty nodes preventing jobs from landing on bad hardware repeatedly ensuring continuous delivery while cluster admins have immediate feedback on where & what is causing an alert from one location.