gpu-saturation.md 9.3 KB

GPU Saturation Signals

OpenCost derives GPU saturation signals from dcgm-exporter metrics, following the USE method:

  • Utilization — how busy the GPU was (already exposed as gpuAllocation.gpuUsageAverage from DCGM_FI_PROF_GR_ENGINE_ACTIVE).
  • Saturation — work that was queued, rejected, or slowed because the GPU could not service demand. That is what the signals below report.

Every signal is an independent primitive. OpenCost deliberately does not compute a composite saturation score; consumers combine the primitives as they see fit.

Absence semantics

A missing field always means the source metric was unavailable — no dcgm-exporter in the cluster, the DCGM field is not in the exporter's configuration, or the GPU lacks DCP profiling support. OpenCost never emits a zero in place of missing data, so 0 can be trusted to mean "observed, and not saturated".

Allocation API

Saturation appears on the Allocation API under gpuAllocation.saturation, per container:

{
  "gpuAllocation": {
    "gpuDevice": "nvidia0",
    "gpuModel": "Tesla T4",
    "gpuUUID": "GPU-...",
    "saturation": {
      "throttleViolationRatios": { "power": 0.12, "thermal": 0.01 },
      "throttleReasonRatios": { "sw_power_cap": 0.15 },
      "memoryUsedRatioAvg": 0.81,
      "memoryUsedRatioMax": 0.97,
      "memoryPressureRatio": 0.25,
      "xidErrorCount": 0,
      "dramActiveAvg": 0.62,
      "smActiveAvg": 0.55,
      "smOccupancyAvg": 0.31,
      "pcieTxBytesAvg": 1.2e9,
      "pcieRxBytesAvg": 2.0e9
    }
  }
}

Controlled by GPU_SATURATION_METRICS_ENABLED (default true).

Signal reference

Ratios are fractions of the queried window in [0, 1] unless noted.

Field DCGM source In default dcgm-exporter config? Needs DCP? Meaning
throttleViolationRatios (power, thermal, sync_boost, board_limit) DCGM_FI_DEV_POWER_VIOLATION, DCGM_FI_DEV_THERMAL_VIOLATION, DCGM_FI_DEV_SYNC_BOOST_VIOLATION, DCGM_FI_DEV_BOARD_LIMIT_VIOLATION Yes No Fraction of the window the GPU spent throttled for the reason, from cumulative microsecond violation counters. The strongest direct saturation signal available by default.
throttleReasonRatios (sw_power_cap, hw_slowdown, sync_boost, sw_thermal, hw_thermal, hw_power_brake) DCGM_FI_DEV_CLOCK_THROTTLE_REASONS (renamed DCGM_FI_DEV_CLOCKS_EVENT_REASONS in DCGM 3.3+; OpenCost queries both) No — must be enabled No Fraction of samples in which each saturation-relevant bit of the NVML throttle-reasons bitmask was set. Richer reason breakdown than the violation counters (hardware slowdown, power brake). Idle/configured-clock bits are excluded by design. Reported for the whole physical GPU even under MIG or time-slicing.
memoryUsedRatioAvg, memoryUsedRatioMax DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE Yes No Framebuffer occupancy used / (used + free). Sustained values near 1.0 mean new allocations are likely to fail or force eviction.
memoryPressureRatio same Yes No Fraction of the window occupancy was at or above the configured threshold (GPU_MEMORY_SATURATION_THRESHOLD, default 0.9).
xidErrorCount DCGM_FI_DEV_XID_ERRORS Yes No XID error events observed in the window, a rejected-work signal. The DCGM field reports the last XID code, so consecutive identical errors are undercounted.
dramActiveAvg, dramActiveMax DCGM_FI_PROF_DRAM_ACTIVE Yes Yes Ratio of cycles the device memory interface was active. Near-ceiling values with low smOccupancyAvg indicate a memory-bandwidth-bound workload.
smActiveAvg DCGM_FI_PROF_SM_ACTIVE No — must be enabled Yes Ratio of cycles at least one warp was resident on any SM.
smOccupancyAvg DCGM_FI_PROF_SM_OCCUPANCY No — must be enabled Yes Ratio of resident warps to the SM maximum. Together with smActiveAvg and dramActiveAvg, lets consumers distinguish compute-bound vs bandwidth-bound vs latency-bound saturation.
pcieTxBytesAvg, pcieRxBytesAvg DCGM_FI_PROF_PCIE_TX_BYTES, DCGM_FI_PROF_PCIE_RX_BYTES Yes Yes Average PCIe throughput in bytes/sec. DCGM does not export link capacity, so these are raw rates; deriving a capacity ratio is future work.
nvlinkTxBytesAvg, nvlinkRxBytesAvg DCGM_FI_PROF_NVLINK_TX_BYTES, DCGM_FI_PROF_NVLINK_RX_BYTES No — must be enabled Yes Average NVLink throughput in bytes/sec; same capacity caveat as PCIe.

"Needs DCP" means the field comes from DCGM's profiling module (DCP), which requires Volta or newer GPUs and is unavailable in some virtualized environments. "Must be enabled" means the field exists in the dcgm-exporter default configuration file but is commented out (or absent), and must be uncommented/added for the signal to appear.

Interpretation guidance (USE)

  • High utilization with zero throttle/pressure ratios: the GPU is busy but keeping up — buying a faster GPU may not help.
  • throttleViolationRatios.power or sw_power_cap sustained above zero: the GPU is power-limited; demand exceeds what the power envelope can service.
  • memoryUsedRatioMax near 1.0 or memoryPressureRatio > 0 alongside xidErrorCount > 0: work is likely being rejected (OOM-style failures).
  • dramActiveAvg near ceiling with low smOccupancyAvg: bandwidth-bound; with high smOccupancyAvg: genuinely compute-saturated.

Attribution caveat

Device-level signals (throttling, framebuffer, XID) are attributed to containers via the pod labels dcgm-exporter attaches. For exclusively assigned GPUs this is exact. For time-sliced or MPS-shared GPUs, every sharing container sees the same device-level saturation; the signal tells you the device was saturated, not which tenant caused it. MIG slices are reported as distinct devices (dcgm-exporter GPU_I_ID / GPU_I_PROFILE labels), except the throttle-reasons bitmask, which is physical-GPU-scoped.

Scheduler-level saturation metrics

DCGM cannot see work that never reached a GPU. OpenCost additionally emits cluster-scoped gauges on /metrics from the Kubernetes scheduler's view, one series per observed GPU resource name (nvidia.com/gpu, nvidia.com/gpu.shared, nvidia.com/mig-*):

Metric Meaning
cluster_gpu_pending_pod_count{resource} Pods in Pending phase requesting the resource.
cluster_gpu_pending_request_total{resource} GPU units requested by those pending pods.
cluster_gpu_requested_allocatable_ratio{resource} Units requested by all non-terminated pods divided by allocatable units. Values near or above 1 indicate scheduler-level saturation, including exhaustion of time-sliced/MPS replicas. Only emitted when allocatable capacity exists.

These can be disabled individually through the standard metrics configuration (the same mechanism as node_gpu_count et al.).

Configuration

Variable Default Description
GPU_SATURATION_METRICS_ENABLED true Query and apply GPU saturation signals in the Allocation pipeline.
GPU_MEMORY_SATURATION_THRESHOLD 0.9 Framebuffer occupancy ratio above which the GPU counts as memory-pressured. Values outside (0, 1] fall back to the default.

Data source differences

  • Prometheus source: the throttle-bitmask and memory-pressure signals use PromQL subqueries at the configured query resolution.
  • Collector source: framebuffer occupancy is joined from FB_USED and FB_FREE per scrape by a metric synthesizer, producing a per-sample ratio that the occupancy and pressure aggregations consume. All signals are supported.

Allocation half: DRA

Telemetry reports what devices did; Dynamic Resource Allocation (resource.k8s.io/v1, k8s 1.34+) reports what was requested, allocated, and reserved. The kubemodel carries both:

  • ResourceSlices — driver-advertised device capacity per node/pool, including driver-published attributes and capacity quantities.
  • ResourceClaims — device requests (class, count), scheduler allocations (driver/pool/device), and the pods that reserved them (reservedForPodUids). A reserved-but-idle device appears here even though it never shows in DCGM usage.
  • Hydration joins the halves: each allocated device's UUID is resolved from its slice attributes (uuid, or driver-qualified */uuid), matching DCGMDevice.UUID so claims link to telemetry directly.

Claims and slices are cluster state, not time series: the model carries the state observed at hydration time. Clusters without the DRA API (or without RBAC) hydrate nothing — absence, not zeros.

RBAC: the OpenCost service account needs list/watch on resourceclaims and resourceslices in the resource.k8s.io API group (helm chart follow-up).

Future work

  • Non-NVIDIA GPUs (AMD ROCm SMI exporter, Intel XPU manager) — the signal taxonomy is vendor-neutral, the queries are not.
  • PCIe/NVLink capacity ratios, once link capacity can be derived per model.
  • Device-level saturation in the kubemodel pipeline is modeled on the vendor-specific device type (DCGMDevice.Saturation) behind the vendor-neutral DeviceInfo/DevicePerformance/DeviceSaturation interfaces, but not yet populated; it will be wired into the DCGM hydration path alongside the existing device usage collection.