Dashboard: Plutono

This is a standard system metrics dashboard based on the OSS tool Plutono that is a fork of Grafana.

The plots depict the usual metrics AI developers want to see: GPU metrics (utilization, SM activity, VRAM usage), as well as DRAM, CPU, network, and disk metrics.

All metrics are depicted as window-adjustable time series. The right column shows them per machine in the cluster, averaged within each machine for multi-core CPUs and multiple GPUs. The left columns shows the global average for the cluster across all machines.

_images/plutono1.png _images/plutono2.png _images/plutono3.png