Get training job metrics
Get the metrics for a training job.
Authorizations
Send Authorization: Bearer <api_key>. The legacy Authorization: Api-Key <api_key> scheme is also accepted.
Query Parameters
Epoch millis timestamp to end fetching metrics
Epoch millis timestamp to start fetching metrics.
Resolution of the returned series, in seconds. When omitted, a step is derived from the time range so large windows return fewer points.
Response
A response to fetch training job metrics. The outer list for each metric represents that metric across time.
A map of GPU rank to memory usage for the training job. For multinode jobs, this is the memory usage of the leader unless specified otherwise.
A map of GPU rank to fractional GPU utilization. For multinode jobs, this is the GPU utilization of the leader unless specified otherwise.
The CPU usage measured in cores. For multinode jobs, this is the CPU usage of the leader unless specified otherwise.
The CPU memory usage for the training job. For multinode jobs, this is the CPU memory usage of the leader unless specified otherwise.
The storage usage for the ephemeral storage. For multinode jobs, this is the ephemeral storage usage of the leader unless specified otherwise.
The training job.
The storage usage for the read-write cache.
The metrics for each node in the training job.