Observability#
Interactive Grafana Dashboard Demo
Access Grafana#
A local Grafana instance is deployed as part of the observability stack. The dashboard shows an overview of the available GPUs, pending/active workloads, and over all cluster utilization.
We can use kubectl port-forward
to access the grafana service from our laptop. For the example above,
$ kubectl port-forward -n prometheus svc/kube-prometheus-stack-grafana 3000:80
In the example above, we can enter https://localhost:3000/
into a browser window where it will prompt for a password.
The default username is admin
with the password being set by kube-prometheus-stack.values
in Installation.
Administrators should secure this endpoint as well as changing the authentication login.
Afterwards navigate to Dashboards -> Konduktor to access our provided dashboard
Metrics Dashboard#
Our metrics dashboard is included in the kube-prometheus-stack
installation using the JSON definition from the repo under grafana/default_grafana_dashboard.json
A interactive sample dashboard can be found here.
To track cluster GPU utilization, useful metrics to track include:
GPU utilization
GPU memory usage
GPU SM efficiency
Multinode workloads performance benefits from tracking:
NVLINK bandwidth
Infiniband throughput (only for Infiniband networked setups)
For clusters with job queueing enabled we included:
Jobs pending/active and number of GPUs requested
Number of GPUs allocated vs free
Node level stats include:
Disk usage
CPU utilization
Reading Logs#
Included in the installation is a Loki logging backend and datasource.
Our default dashboard includes a panel for listing error logs from pods in the default
namespace.
As well as (S)Xid errors by following dmesg
on each node. You can also perform arbitrary
LogQL queries by visiting the Explore tab.