GPU Saturation Signals

OpenCost derives GPU saturation signals from dcgm-exporter metrics, following the USE method:

Utilization — how busy the GPU was (already exposed as gpuAllocation.gpuUsageAverage from DCGM_FI_PROF_GR_ENGINE_ACTIVE).
Saturation — work that was queued, rejected, or slowed because the GPU could not service demand. That is what the signals below report.

Every signal is an independent primitive. OpenCost deliberately does not compute a composite saturation score; consumers combine the primitives as they see fit.

Absence semantics

A missing field always means the source metric was unavailable — no dcgm-exporter in the cluster, the DCGM field is not in the exporter's configuration, or the GPU lacks DCP profiling support. OpenCost never emits a zero in place of missing data, so 0 can be trusted to mean "observed, and not saturated".

Allocation API

Saturation appears on the Allocation API under gpuAllocation.saturation, per container:

{
  "gpuAllocation": {
    "gpuDevice": "nvidia0",
    "gpuModel": "Tesla T4",
    "gpuUUID": "GPU-...",
    "saturation": {
      "throttleViolationRatios": { "power": 0.12, "thermal": 0.01 },
      "throttleReasonRatios": { "sw_power_cap": 0.15 },
      "memoryUsedRatioAvg": 0.81,
      "memoryUsedRatioMax": 0.97,
      "memoryPressureRatio": 0.25,
      "xidErrorCount": 0,
      "dramActiveAvg": 0.62,
      "smActiveAvg": 0.55,
      "smOccupancyAvg": 0.31,
      "pcieTxBytesAvg": 1.2e9,
      "pcieRxBytesAvg": 2.0e9
    }
  }
}

Controlled by GPU_SATURATION_METRICS_ENABLED (default true).

Signal reference

Ratios are fractions of the queried window in [0, 1] unless noted.

Field	DCGM source	In default dcgm-exporter config?	Needs DCP?	Meaning
`throttleViolationRatios` (`power`, `thermal`, `sync_boost`, `board_limit`)	`DCGM_FI_DEV_POWER_VIOLATION`, `DCGM_FI_DEV_THERMAL_VIOLATION`, `DCGM_FI_DEV_SYNC_BOOST_VIOLATION`, `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION`	Yes	No	Fraction of the window the GPU spent throttled for the reason, from cumulative microsecond violation counters. The strongest direct saturation signal available by default.
`throttleReasonRatios` (`sw_power_cap`, `hw_slowdown`, `sync_boost`, `sw_thermal`, `hw_thermal`, `hw_power_brake`)	`DCGM_FI_DEV_CLOCK_THROTTLE_REASONS` (renamed `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` in DCGM 3.3+; OpenCost queries both)	No — must be enabled	No	Fraction of samples in which each saturation-relevant bit of the NVML throttle-reasons bitmask was set. Richer reason breakdown than the violation counters (hardware slowdown, power brake). Idle/configured-clock bits are excluded by design. Reported for the whole physical GPU even under MIG or time-slicing.
`memoryUsedRatioAvg`, `memoryUsedRatioMax`	`DCGM_FI_DEV_FB_USED`, `DCGM_FI_DEV_FB_FREE`	Yes	No	Framebuffer occupancy `used / (used + free)`. Sustained values near 1.0 mean new allocations are likely to fail or force eviction.
`memoryPressureRatio`	same	Yes	No	Fraction of the window occupancy was at or above the configured threshold (`GPU_MEMORY_SATURATION_THRESHOLD`, default `0.9`).
`xidErrorCount`	`DCGM_FI_DEV_XID_ERRORS`	Yes	No	XID error events observed in the window, a rejected-work signal. The DCGM field reports the last XID code, so consecutive identical errors are undercounted.
`dramActiveAvg`, `dramActiveMax`	`DCGM_FI_PROF_DRAM_ACTIVE`	Yes	Yes	Ratio of cycles the device memory interface was active. Near-ceiling values with low `smOccupancyAvg` indicate a memory-bandwidth-bound workload.
`smActiveAvg`	`DCGM_FI_PROF_SM_ACTIVE`	No — must be enabled	Yes	Ratio of cycles at least one warp was resident on any SM.
`smOccupancyAvg`	`DCGM_FI_PROF_SM_OCCUPANCY`	No — must be enabled	Yes	Ratio of resident warps to the SM maximum. Together with `smActiveAvg` and `dramActiveAvg`, lets consumers distinguish compute-bound vs bandwidth-bound vs latency-bound saturation.
`pcieTxBytesAvg`, `pcieRxBytesAvg`	`DCGM_FI_PROF_PCIE_TX_BYTES`, `DCGM_FI_PROF_PCIE_RX_BYTES`	Yes	Yes	Average PCIe throughput in bytes/sec. DCGM does not export link capacity, so these are raw rates; deriving a capacity ratio is future work.
`nvlinkTxBytesAvg`, `nvlinkRxBytesAvg`	`DCGM_FI_PROF_NVLINK_TX_BYTES`, `DCGM_FI_PROF_NVLINK_RX_BYTES`	No — must be enabled	Yes	Average NVLink throughput in bytes/sec; same capacity caveat as PCIe.

"Needs DCP" means the field comes from DCGM's profiling module (DCP), which requires Volta or newer GPUs and is unavailable in some virtualized environments. "Must be enabled" means the field exists in the dcgm-exporter default configuration file but is commented out (or absent), and must be uncommented/added for the signal to appear.

Interpretation guidance (USE)

High utilization with zero throttle/pressure ratios: the GPU is busy but keeping up — buying a faster GPU may not help.
throttleViolationRatios.power or sw_power_cap sustained above zero: the GPU is power-limited; demand exceeds what the power envelope can service.
memoryUsedRatioMax near 1.0 or memoryPressureRatio > 0 alongside xidErrorCount > 0: work is likely being rejected (OOM-style failures).
dramActiveAvg near ceiling with low smOccupancyAvg: bandwidth-bound; with high smOccupancyAvg: genuinely compute-saturated.

Attribution caveat

Device-level signals (throttling, framebuffer, XID) are attributed to containers via the pod labels dcgm-exporter attaches. For exclusively assigned GPUs this is exact. For time-sliced or MPS-shared GPUs, every sharing container sees the same device-level saturation; the signal tells you the device was saturated, not which tenant caused it. MIG slices are reported as distinct devices (dcgm-exporter GPU_I_ID / GPU_I_PROFILE labels), except the throttle-reasons bitmask, which is physical-GPU-scoped.

Scheduler-level saturation metrics

DCGM cannot see work that never reached a GPU. OpenCost additionally emits cluster-scoped gauges on /metrics from the Kubernetes scheduler's view, one series per observed GPU resource name (nvidia.com/gpu, nvidia.com/gpu.shared, nvidia.com/mig-*):

Metric	Meaning
`cluster_gpu_pending_pod_count{resource}`	Pods in `Pending` phase requesting the resource.
`cluster_gpu_pending_request_total{resource}`	GPU units requested by those pending pods.
`cluster_gpu_requested_allocatable_ratio{resource}`	Units requested by all non-terminated pods divided by allocatable units. Values near or above 1 indicate scheduler-level saturation, including exhaustion of time-sliced/MPS replicas. Only emitted when allocatable capacity exists.

These can be disabled individually through the standard metrics configuration (the same mechanism as node_gpu_count et al.).

Configuration

Variable	Default	Description
`GPU_SATURATION_METRICS_ENABLED`	`true`	Query and apply GPU saturation signals in the Allocation pipeline.
`GPU_MEMORY_SATURATION_THRESHOLD`	`0.9`	Framebuffer occupancy ratio above which the GPU counts as memory-pressured. Values outside `(0, 1]` fall back to the default.

Data source differences

Prometheus source: the throttle-bitmask and memory-pressure signals use PromQL subqueries at the configured query resolution.
Collector source: framebuffer occupancy is joined from FB_USED and FB_FREE per scrape by a metric synthesizer, producing a per-sample ratio that the occupancy and pressure aggregations consume. All signals are supported.

Allocation half: DRA

Telemetry reports what devices did; Dynamic Resource Allocation (resource.k8s.io/v1, k8s 1.34+) reports what was requested, allocated, and reserved. The kubemodel carries both:

ResourceSlices — driver-advertised device capacity per node/pool, including driver-published attributes and capacity quantities.
ResourceClaims — device requests (class, count), scheduler allocations (driver/pool/device), and the pods that reserved them (reservedForPodUids). A reserved-but-idle device appears here even though it never shows in DCGM usage.
Hydration joins the halves: each allocated device's UUID is resolved from its slice attributes (uuid, or driver-qualified */uuid), matching DCGMDevice.UUID so claims link to telemetry directly.

Claims and slices are cluster state, not time series: the model carries the state observed at hydration time. Clusters without the DRA API (or without RBAC) hydrate nothing — absence, not zeros.

RBAC: the OpenCost service account needs list/watch on resourceclaims and resourceslices in the resource.k8s.io API group (helm chart follow-up).

Future work

Non-NVIDIA GPUs (AMD ROCm SMI exporter, Intel XPU manager) — the signal taxonomy is vendor-neutral, the queries are not.
PCIe/NVLink capacity ratios, once link capacity can be derived per model.
Device-level saturation in the kubemodel pipeline is modeled on the vendor-specific device type (DCGMDevice.Saturation) behind the vendor-neutral DeviceInfo/DevicePerformance/DeviceSaturation interfaces, but not yet populated; it will be wired into the DCGM hydration path alongside the existing device usage collection.

gpu-saturation.md 9.3 KB Permalink Riwayat Mentahan