Skip to content

fix(gpu): add Tegra/Jetson GPU support#625

Draft
elezar wants to merge 4 commits intomainfrom
fix/tegra-gpu-support
Draft

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 4 commits intomainfrom
fix/tegra-gpu-support

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 26, 2026

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

Changes

  • Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
  • Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
  • Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
  • Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits March 26, 2026 14:44
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d
(read-only) into the gateway container when it exists, so the nvidia
runtime running inside k3s can apply the same host-file injection
config as on the host — required for Jetson/Tegra platforms.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for
mounting /etc/nvidia-container-runtime/host-files-for-container.d into the
device plugin pod, required for correct CDI spec generation on Tegra-based
systems.

Also included is an nvcdi API bump that ensures that additional GIDs are
included in the generated CDI spec.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries
from /etc/group, discarding GIDs injected by the container runtime via
CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the
container-level GIDs before initgroups runs and merge them back
afterwards, excluding GID 0 (root) to avoid privilege retention.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi
rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox.
Fall back to the full path when the bare command is not found.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 26, 2026
@elezar
Copy link
Member Author

elezar commented Mar 26, 2026

cc @johnnynunez

@johnnynunez
Copy link

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants