# GPU Saturation Signals

OpenCost derives GPU **saturation** signals from
[dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) metrics, following
the [USE method](https://www.brendangregg.com/usemethod.html):

- **Utilization** — how busy the GPU was (already exposed as
  `gpuAllocation.gpuUsageAverage` from `DCGM_FI_PROF_GR_ENGINE_ACTIVE`).
- **Saturation** — work that was queued, rejected, or slowed because the GPU
  could not service demand. That is what the signals below report.

Every signal is an independent primitive. OpenCost deliberately does **not**
compute a composite saturation score; consumers combine the primitives as
they see fit.

## Absence semantics

A missing field always means *the source metric was unavailable* — no
dcgm-exporter in the cluster, the DCGM field is not in the exporter's
configuration, or the GPU lacks DCP profiling support. OpenCost never emits
a zero in place of missing data, so `0` can be trusted to mean "observed,
and not saturated".

## Allocation API

Saturation appears on the Allocation API under
`gpuAllocation.saturation`, per container:

```json
{
  "gpuAllocation": {
    "gpuDevice": "nvidia0",
    "gpuModel": "Tesla T4",
    "gpuUUID": "GPU-...",
    "saturation": {
      "throttleViolationRatios": { "power": 0.12, "thermal": 0.01 },
      "throttleReasonRatios": { "sw_power_cap": 0.15 },
      "memoryUsedRatioAvg": 0.81,
      "memoryUsedRatioMax": 0.97,
      "memoryPressureRatio": 0.25,
      "xidErrorCount": 0,
      "dramActiveAvg": 0.62,
      "smActiveAvg": 0.55,
      "smOccupancyAvg": 0.31,
      "pcieTxBytesAvg": 1.2e9,
      "pcieRxBytesAvg": 2.0e9
    }
  }
}
```

Controlled by `GPU_SATURATION_METRICS_ENABLED` (default `true`).

## Signal reference

Ratios are fractions of the queried window in `[0, 1]` unless noted.

| Field | DCGM source | In default dcgm-exporter config? | Needs DCP? | Meaning |
|---|---|---|---|---|
| `throttleViolationRatios` (`power`, `thermal`, `sync_boost`, `board_limit`) | `DCGM_FI_DEV_POWER_VIOLATION`, `DCGM_FI_DEV_THERMAL_VIOLATION`, `DCGM_FI_DEV_SYNC_BOOST_VIOLATION`, `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | Yes | No | Fraction of the window the GPU spent throttled for the reason, from cumulative microsecond violation counters. The strongest direct saturation signal available by default. |
| `throttleReasonRatios` (`sw_power_cap`, `hw_slowdown`, `sync_boost`, `sw_thermal`, `hw_thermal`, `hw_power_brake`) | `DCGM_FI_DEV_CLOCK_THROTTLE_REASONS` (renamed `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` in DCGM 3.3+; OpenCost queries both) | **No — must be enabled** | No | Fraction of samples in which each saturation-relevant bit of the NVML throttle-reasons bitmask was set. Richer reason breakdown than the violation counters (hardware slowdown, power brake). Idle/configured-clock bits are excluded by design. Reported for the whole physical GPU even under MIG or time-slicing. |
| `memoryUsedRatioAvg`, `memoryUsedRatioMax` | `DCGM_FI_DEV_FB_USED`, `DCGM_FI_DEV_FB_FREE` | Yes | No | Framebuffer occupancy `used / (used + free)`. Sustained values near 1.0 mean new allocations are likely to fail or force eviction. |
| `memoryPressureRatio` | same | Yes | No | Fraction of the window occupancy was at or above the configured threshold (`GPU_MEMORY_SATURATION_THRESHOLD`, default `0.9`). |
| `xidErrorCount` | `DCGM_FI_DEV_XID_ERRORS` | Yes | No | XID error events observed in the window, a rejected-work signal. The DCGM field reports the *last* XID code, so consecutive identical errors are undercounted. |
| `dramActiveAvg`, `dramActiveMax` | `DCGM_FI_PROF_DRAM_ACTIVE` | Yes | **Yes** | Ratio of cycles the device memory interface was active. Near-ceiling values with low `smOccupancyAvg` indicate a memory-bandwidth-bound workload. |
| `smActiveAvg` | `DCGM_FI_PROF_SM_ACTIVE` | **No — must be enabled** | **Yes** | Ratio of cycles at least one warp was resident on any SM. |
| `smOccupancyAvg` | `DCGM_FI_PROF_SM_OCCUPANCY` | **No — must be enabled** | **Yes** | Ratio of resident warps to the SM maximum. Together with `smActiveAvg` and `dramActiveAvg`, lets consumers distinguish compute-bound vs bandwidth-bound vs latency-bound saturation. |
| `pcieTxBytesAvg`, `pcieRxBytesAvg` | `DCGM_FI_PROF_PCIE_TX_BYTES`, `DCGM_FI_PROF_PCIE_RX_BYTES` | Yes | **Yes** | Average PCIe throughput in bytes/sec. DCGM does not export link capacity, so these are raw rates; deriving a capacity ratio is future work. |
| `nvlinkTxBytesAvg`, `nvlinkRxBytesAvg` | `DCGM_FI_PROF_NVLINK_TX_BYTES`, `DCGM_FI_PROF_NVLINK_RX_BYTES` | **No — must be enabled** | **Yes** | Average NVLink throughput in bytes/sec; same capacity caveat as PCIe. |

"Needs DCP" means the field comes from DCGM's profiling module (DCP), which
requires Volta or newer GPUs and is unavailable in some virtualized
environments. "Must be enabled" means the field exists in the dcgm-exporter
default configuration file but is commented out (or absent), and must be
uncommented/added for the signal to appear.

### Interpretation guidance (USE)

- High utilization with **zero** throttle/pressure ratios: the GPU is busy
  but keeping up — buying a faster GPU may not help.
- `throttleViolationRatios.power` or `sw_power_cap` sustained above zero:
  the GPU is power-limited; demand exceeds what the power envelope can
  service.
- `memoryUsedRatioMax` near 1.0 or `memoryPressureRatio` > 0 alongside
  `xidErrorCount` > 0: work is likely being rejected (OOM-style failures).
- `dramActiveAvg` near ceiling with low `smOccupancyAvg`: bandwidth-bound;
  with high `smOccupancyAvg`: genuinely compute-saturated.

### Attribution caveat

Device-level signals (throttling, framebuffer, XID) are attributed to
containers via the pod labels dcgm-exporter attaches. For exclusively
assigned GPUs this is exact. For time-sliced or MPS-shared GPUs, every
sharing container sees the same device-level saturation; the signal tells
you the *device* was saturated, not which tenant caused it. MIG slices are
reported as distinct devices (dcgm-exporter `GPU_I_ID` / `GPU_I_PROFILE`
labels), except the throttle-reasons bitmask, which is physical-GPU-scoped.

## Scheduler-level saturation metrics

DCGM cannot see work that never reached a GPU. OpenCost additionally emits
cluster-scoped gauges on `/metrics` from the Kubernetes scheduler's view,
one series per observed GPU resource name (`nvidia.com/gpu`,
`nvidia.com/gpu.shared`, `nvidia.com/mig-*`):

| Metric | Meaning |
|---|---|
| `cluster_gpu_pending_pod_count{resource}` | Pods in `Pending` phase requesting the resource. |
| `cluster_gpu_pending_request_total{resource}` | GPU units requested by those pending pods. |
| `cluster_gpu_requested_allocatable_ratio{resource}` | Units requested by all non-terminated pods divided by allocatable units. Values near or above 1 indicate scheduler-level saturation, including exhaustion of time-sliced/MPS replicas. Only emitted when allocatable capacity exists. |

These can be disabled individually through the standard metrics
configuration (the same mechanism as `node_gpu_count` et al.).

## Configuration

| Variable | Default | Description |
|---|---|---|
| `GPU_SATURATION_METRICS_ENABLED` | `true` | Query and apply GPU saturation signals in the Allocation pipeline. |
| `GPU_MEMORY_SATURATION_THRESHOLD` | `0.9` | Framebuffer occupancy ratio above which the GPU counts as memory-pressured. Values outside `(0, 1]` fall back to the default. |

## Data source differences

- **Prometheus source**: the throttle-bitmask and memory-pressure signals
  use PromQL subqueries at the configured query resolution.
- **Collector source**: framebuffer occupancy is joined from FB_USED and
  FB_FREE per scrape by a metric synthesizer, producing a per-sample ratio
  that the occupancy and pressure aggregations consume. All signals are
  supported.

## Allocation half: DRA

Telemetry reports what devices *did*; Dynamic Resource Allocation
(`resource.k8s.io/v1`, k8s 1.34+) reports what was *requested, allocated,
and reserved*. The kubemodel carries both:

- `ResourceSlices` — driver-advertised device capacity per node/pool,
  including driver-published attributes and capacity quantities.
- `ResourceClaims` — device requests (class, count), scheduler allocations
  (driver/pool/device), and the pods that reserved them
  (`reservedForPodUids`). A reserved-but-idle device appears here even
  though it never shows in DCGM usage.
- Hydration joins the halves: each allocated device's UUID is resolved from
  its slice attributes (`uuid`, or driver-qualified `*/uuid`), matching
  `DCGMDevice.UUID` so claims link to telemetry directly.

Claims and slices are cluster state, not time series: the model carries the
state observed at hydration time. Clusters without the DRA API (or without
RBAC) hydrate nothing — absence, not zeros.

**RBAC**: the OpenCost service account needs `list`/`watch` on
`resourceclaims` and `resourceslices` in the `resource.k8s.io` API group
(helm chart follow-up).

## Future work

- Non-NVIDIA GPUs (AMD ROCm SMI exporter, Intel XPU manager) — the signal
  taxonomy is vendor-neutral, the queries are not.
- PCIe/NVLink capacity ratios, once link capacity can be derived per model.
- Device-level saturation in the kubemodel pipeline is modeled on the
  vendor-specific device type (`DCGMDevice.Saturation`) behind the
  vendor-neutral `DeviceInfo`/`DevicePerformance`/`DeviceSaturation`
  interfaces, but not yet populated; it will be wired into the DCGM
  hydration path alongside the existing device usage collection.