Observability#

dashboard

Interactive Grafana Dashboard Demo

Access Grafana#

A local Grafana instance is deployed as part of the observability stack. The dashboard shows an overview of the available GPUs, pending/active workloads, and over all cluster utilization.

We can use kubectl port-forward to access the grafana service from our laptop. For the example above,

$ kubectl port-forward -n prometheus svc/kube-prometheus-stack-grafana 3000:80

In the example above, we can enter https://localhost:3000/ into a browser window where it will prompt for a password. The default username is admin with the password being set by kube-prometheus-stack.values in Installation. Administrators should secure this endpoint as well as changing the authentication login.

Afterwards navigate to Dashboards -> Konduktor to access our provided dashboard

Metrics Dashboard#

Our metrics dashboard is included in the kube-prometheus-stack installation using the JSON definition from the repo under grafana/default_grafana_dashboard.json A interactive sample dashboard can be found here.

To track cluster GPU utilization, useful metrics to track include:

  • GPU utilization

  • GPU memory usage

  • GPU SM efficiency

Multinode workloads performance benefits from tracking:

  • NVLINK bandwidth

  • Infiniband throughput (only for Infiniband networked setups)

For clusters with job queueing enabled we included:

  • Jobs pending/active and number of GPUs requested

  • Number of GPUs allocated vs free

Node level stats include:

  • Disk usage

  • CPU utilization

Reading Logs#

Included in the installation is a Loki logging backend and datasource.

Our default dashboard includes a panel for listing error logs from pods in the default namespace. As well as (S)Xid errors by following dmesg on each node. You can also perform arbitrary LogQL queries by visiting the Explore tab.

dashboard