Node Health Controller#

Konduktor ships with a controller that listens to node logs for GPU related errors. Oftentimes, errors from NCCL, CUDA, or GPUs point to persistent errors in the hardware. In this situation, in order to prevent workloads from using bad nodes, we taint them so the kube-scheduler doesn’t place any new pods on them. Currently we listen for these NVIDIA errors:

  • Xid errors

  • SXid errors

And we listen for errors from:

  • dmesg

  • Pod/container logs

Controller Launch#

The controller runs as a loop and can be run from any machine with access to the k8s API server. The control loop:

  • Queries the logging backend (Loki) for GPU related errors

  • If an error is found, the affected node is tainted with trainy.konduktor.ai/faulty=true:NoSchedule via the k8s API

Incluster Controller#

The controller can be shipped as a deployment that runs within the cluster. To deploy it:

# create the controller deployment
$ kubectl apply -f https://raw.githubusercontent.com/Trainy-ai/konduktor/main/konduktor/manifests/controller_deployment.yaml

# tail the logs of the deployment
$ kubectl logs -f deployment/konduktor-controller-deployment -n konduktor

Remote Controller#

We can launch the controller locally from a machine with external access to the API server with:

# install konduktor package
$ pip install konduktor-nightly

# get local access to the loki service
$ kubectl port-forward svc/loki -n loki 3100:3100 &

# run the controller locally
$ LOG_ENDPOINT='http://localhost:3100' python -m konduktor.controller.launch
I 07-09 04:51:21 parse.py:24] using POD_LOG_TYPE = skypilot

Controller Node Taint Test (Optional)#

We can mock having a GPU error by writing to dmesg directly. We can do this through the dmesg-logging DaemonSet which runs as privileged containers. In a separate terminal, while the controller is running:

# get the name of a dmesg-logging pod
$ kubectl get pods -n dmesg-logging
NAME          READY   STATUS    RESTARTS   AGE
dmesg-2x225   1/1     Running   0          10h

$ kubectl exec -it -n dmesg-logging dmesg-2x225 -- bash
$ echo "NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus." > /dev/kmsg

After which you should see in your controller logs

[I 07-09 05:37:45 parse.py:128] node `gke-a3-cluster-gpu-pool-2d164072-zz64` has dmesg error: [538441.007373] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[W 07-09 05:37:45 kube_client.py:27] incluster config failed to load, attempting to use kubeconfig.
[I 07-09 05:37:45 kube_client.py:31] KUBECONFIG loaded
[I 07-09 05:37:45 node.py:98] Node gke-a3-cluster-gpu-pool-2d164072-zz64 tainted.

# in a separate terminal you can verify
$ kubectl describe node gke-a3-cluster-gpu-pool-2d164072-zz64 | grep trainy
trainy.konduktor.ai/faulty=true:NoSchedule

You can remove all the taints in the cluster with konduktor reset

(konduktor) Andrews-MacBook-Air:docs asai$ konduktor reset
[W 07-09 05:38:14 kube_client.py:27] incluster config failed to load, attempting to use kubeconfig.
[I 07-09 05:38:14 kube_client.py:31] KUBECONFIG loaded
[I 07-09 05:38:15 node.py:64] Node gke-a3-cluster-cpu-pool-2d164072-zz64 taint removed.
[I 07-09 05:38:15 node.py:64] Node gke-a3-cluster-default-pool-60f92594-0nm7 taint removed.
[I 07-09 05:38:15 node.py:64] Node gke-a3-cluster-default-pool-60f92594-rfg8 taint removed.
[I 07-09 05:38:15 node.py:64] Node gke-a3-cluster-default-pool-60f92594-xvvx taint removed.
[I 07-09 05:38:16 node.py:64] Node gke-a3-cluster-t4-nodepool-528edcef-fl02 taint removed.

Features and Roadmap#

  • dmesg error detection - Available

  • In-cluster deployment of controller - Available

  • Pod log error detection - Available

  • Health Checks (Taint Removal) - In progress 🚧

  • Node Resolution Hooks (Reboot, Power Cycle) - In progress 🚧