Installation#
This section is for k8s admins who are first deploying the necessary resources onto their k8s cluster. The Konduktor stack consists of the following components:
DCGM Exporter - Exporting GPU health metrics and managing node lifecycle
kube-prometheus-stack - Deploy Prometheus & Grafana stack for observability
Loki - Logging aggregration endpoint
OpenTelemetry - Log publishing
Kueue - workload scheduling and resource quotas/sharing
For a more thorough explanation of the Konduktor stack, see Architecture
Prerequisites#
Before starting, make sure that you have:
A Kubernetes cluster (1.28+)
kubectl
Observability#
DCGM Installation#
Installing the DCGM exporter is best handled using NVIDIA’s gpu-operator. To install, you can run:
# add nvidia repo
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
# create the `gpu-operator` namespace
$ kubectl create namespace gpu-operator
# set which metrics to export from DCGM
$ wget https://raw.githubusercontent.com/Trainy-ai/konduktor/main/files/dcgm-metrics.csv
$ vim dcgm-metrics.csv
$ kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
# install gpu operator
$ helm install gpu-operator -n gpu-operator \
nvidia/gpu-operator $HELM_OPTIONS \
--set dcgmExporter.config.name=metrics-config \
--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS \
--set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv
# wait for pods to come up
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-b96rm 1/1 Running 0 9d
gpu-operator-1716420454-node-feature-discovery-master-5bf9ts44g 1/1 Running 0 4d10h
gpu-operator-1716420454-node-feature-discovery-worker-jlr26 1/1 Running 5 (4d10h ago) 14d
gpu-operator-647d5bddf8-p6px2 1/1 Running 0 4d10h
nvidia-container-toolkit-daemonset-tpncr 1/1 Running 0 14d
nvidia-cuda-validator-mmb4h 0/1 Completed 0 9d
nvidia-dcgm-exporter-m7544 1/1 Running 0 9d
nvidia-device-plugin-daemonset-lc5lx 1/1 Running 0 14d
nvidia-driver-daemonset-fvx9z 1/1 Running 0 9d
nvidia-operator-validator-62dhx 1/1 Running 0 14d
Warning
This guide currently works for on-prem bare metal deployments.
We are still validating on how to deploy nvidia-dcgm-exporter
on managed k8s solutions like AWS’s EKS and Google’s GKE. Stay tuned for updates!
Prometheus-Grafana Stack#
To setup the monitoring stack, we’re maintaining our own default values to get started with.
# get default values for Helm chart
$ wget https://raw.githubusercontent.com/Trainy-ai/konduktor/main/manifests/kube-prometheus-stack.values
# add promtheus-community repo
$ helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
# install prometheus stack
$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--create-namespace \
--namespace prometheus \
--values kube-prometheus-stack.values
# check prometheus stack is up
$ kubectl get pods -n prometheus
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 53s
kube-prometheus-stack-grafana-79f9ccf77-wccpt 3/3 Running 0 56s
kube-prometheus-stack-kube-state-metrics-b7b54458-klcb4 1/1 Running 0 56s
kube-prometheus-stack-operator-74774b4dbd-bdzsr 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-74245 1/1 Running 0 57s
kube-prometheus-stack-prometheus-node-exporter-8t5ct 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-bp8cb 1/1 Running 0 57s
kube-prometheus-stack-prometheus-node-exporter-ttj5b 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-z8rzn 1/1 Running 0 57s
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 53s
OpenTelemetry-Loki Logging Stack#
For setting up a monolithic Loki stack with exported node/pod metrics, we include some default values for installing the stack via Helm. We also deploy a daemonset to stream dmesg logs from each node.
# get Helm chart values
$ wget https://raw.githubusercontent.com/Trainy-ai/konduktor/main/manifests/loki.values
$ wget https://raw.githubusercontent.com/Trainy-ai/konduktor/main/manifests/otel.values
$ helm repo add grafana https://grafana.github.io/helm-charts
$ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
$ helm repo update
$ helm install loki grafana/loki \
--create-namespace \
--namespace=loki \
--values loki.values
$ helm install otel-collector open-telemetry/opentelemetry-collector \
--create-namespace \
--namespace=otel-collector \
--values otel.values
$ kubectl apply -f https://raw.githubusercontent.com/Trainy-ai/konduktor/main/konduktor/manifests/dmesg_daemonset.yaml
$ kubectl get pods -n loki
NAME READY STATUS RESTARTS AGE
loki-0 1/1 Running 0 35m
loki-canary-26rw2 1/1 Running 0 35m
loki-chunks-cache-0 2/2 Running 0 35m
loki-gateway-68fd56bfbd-ltnqd 1/1 Running 0 35m
loki-results-cache-0 2/2 Running 0 35m
$ kubectl get pods -n otel-collector
NAME READY STATUS RESTARTS AGE
otel-collector-opentelemetry-collector-agent-2qbh2 1/1 Running 0 31m
$ kubectl get pods -n dmesg-logging
NAME READY STATUS RESTARTS AGE
dmesg-2x225 1/1 Running 0 5m52s
Scheduling & Resource Quotas (Optional)#
For job queueing and resource sharing cluster-wide, you can install Kueue and set resource quotas and queues.
Kueue#
To deploy Kueue components, we provide a default manifest for that enables gang-scheduling in addition to other options for telemetry.
# deploy kueue resources
$ VERSION=v0.6.2
$ kubectl apply --server-side -f https://raw.githubusercontent.com/Trainy-ai/konduktor/main/manifests/manifests.yaml
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/prometheus.yaml
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/visibility-api.yaml
# check kueue-system up
$ kubectl get pods -n kueue-system
NAME READY STATUS RESTARTS AGE
kueue-controller-manager-6f4db9964d-rc6jk 2/2 Running 0 4d
Resource Quotas#
Resource quotas are defined via ClusterQueues and LocalQueues which are assigned to a namespace. We provide a default set of resource definitions to get started with.
# get default resource definitions
$ wget https://raw.githubusercontent.com/Trainy-ai/konduktor/main/manifests/single-clusterqueue-setup.yaml
Within single-clusterqueue-setup.yaml
, be sure to replace <num-GPUs-in-cluster>
with the total number of GPUs in your cluster.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
preemption:
reclaimWithinCohort: Any
borrowWithinCohort:
policy: LowerPriority
maxPriorityThreshold: 100
withinClusterQueue: LowerPriority
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "smarter-devices/fuse"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 10000
- name: "memory"
nominalQuota: 10000Gi
- name: "nvidia.com/gpu"
nominalQuota: <num-GPUs-in-cluster> # REPLACE THIS
# this is a skypilot specific resource
- name: "smarter-devices/fuse"
nominalQuota: 10000000
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: "user-queue"
spec:
clusterQueue: "cluster-queue"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: low-priority
value: 100 # Higher value means higher priority
description: "Low priority experiments"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: high-priority
value: 1000
description: "High priority production workloads"
We can create these resources with:
# create a ClusterQueue and LocalQueue, `cluster-queue` and `user-queue` respectively
$ kubectl apply -f single-clusterqueue-setup.yaml
$ kubectl get queues
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
user-queue cluster-queue 0 0