so i m wondering what it looks like to

so, i'm wondering what it looks like to modernize ublue-nvidia ...?
29 Replies
bsherman
bsherman•9mo ago
i'll thread it 😄 a few things have been brought up 1) we install an selinux policy ... and it provides an selinux domain to use in the container label to use nvidia stuff without full on disabling selinux (at least for the container runtime) (example: podman run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda nvidia-smi) but... @akdev has comments which suggest possibly removing it... it's a repo which is old (2 years since last update) https://github.com/NVIDIA/dgx-selinux/tree/master/src/nvidia-container-selinux even then, it was meant to be a reference for RHEL-based systems....not verbatim as used in ublue now... also... the nvidia-container-toolkit RPM includs oci hooks... but those are deprecated should we be removing the old hook.json? and should instead package a unit to generate /etc/cdi/nvidia.yaml ?
akdev
akdev•9mo ago
I think so
bsherman
bsherman•9mo ago
i don't really know the consequences of the change... i think with the CDI way, we can use --device=nvidia.com/gpu=XYX
akdev
akdev•9mo ago
yeah and we can't use the -e NVIDIA_GPU_DEVICES way other than that not sure
bsherman
bsherman•9mo ago
i kinda wonder if all this is too opinionated
akdev
akdev•9mo ago
wdym
bsherman
bsherman•9mo ago
maybe we remove the selinux policy because we can't really verify it's correct in the first place (seems unlikely) but how much auto-configuring a system for podman to run nvidia is too much? hmm... would be nice to have the newer container-toolkit ... we are still on 1.13.5 and 1.14 is where CDI becomes "legit"
akdev
akdev•9mo ago
I think in the future the podman team will remove the default hooks-dir, so we'd have to set that instead if that happens though tbh that seems far in the future so we don't have to move to CDI now and even then setting the hooks-dir in distrobox assemble or whatever would be easy at that point too
bsherman
bsherman•9mo ago
yeah... currently our nvidia-container-toolkit is coming from nvidia's rhel9.0 repo so...
akdev
akdev•9mo ago
looks like they have a new one: curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
bsherman
bsherman•9mo ago
huh, and that actually points to the centos8 repos... yeah
akdev
akdev•9mo ago
it's not distro specific anymore seems like
bsherman
bsherman•9mo ago
i'm going to test that on the ucore side now since I'm setup there right now booyah! new container-toolkit i guess it pays to be hard headed if it leads to progress
Upgraded:
libnvidia-container-tools 1.13.5-1 -> 1.14.2-1
libnvidia-container1 1.13.5-1 -> 1.14.2-1
nvidia-container-toolkit 1.13.5-1 -> 1.14.2-1
nvidia-container-toolkit-base 1.13.5-1 -> 1.14.2-1
Upgraded:
libnvidia-container-tools 1.13.5-1 -> 1.14.2-1
libnvidia-container1 1.13.5-1 -> 1.14.2-1
nvidia-container-toolkit 1.13.5-1 -> 1.14.2-1
nvidia-container-toolkit-base 1.13.5-1 -> 1.14.2-1
and... the oci-hook.json is gone 🙂 i think we have our answer on this part at least the old SElinux policy still works... so... i guess i'm hesitant to remove it, even if the default way is for people to just label=disable it doesn't hurt
$ podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
$ podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
akdev
akdev•9mo ago
fair enough, the main concern is that it may break in the future so if that happens I guess we can remove it
bsherman
bsherman•9mo ago
agreed another interesting side-effect of the upgrade to container-toolkit 1.14 they don't provide a default config.toml anymore
akdev
akdev•9mo ago
What’s that file?
bsherman
bsherman•9mo ago
/etc/nvidia-container-runtime/config.toml
was where we were setting no_usecgroups = true to enable rootless mode i think they fixed it so both rootful and rootless work now
akdev
akdev•9mo ago
Oh that’s nice
bsherman
bsherman•9mo ago
[core@i5nvidia ~]$ sudo podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
[core@i5nvidia ~]$ podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
[core@i5nvidia ~]$ sudo podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
[core@i5nvidia ~]$ podman run --security-opt label=type:nvidia_container_t --device=nvidia.com/gpu=0 --rm nvcr.io/nvidia/cuda nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-1e7e0c41-cecb-9064-171a-f2a22877957b)
akdev
akdev•9mo ago
Would explain why that disappeared
bsherman
bsherman•9mo ago
very nice!
akdev
akdev•9mo ago
Try distrobox I guess maybe it even works (But probably not)
bsherman
bsherman•9mo ago
hah, well, i don't have distrobox on ucore i can layer i suppose correction! i do have it what am i trying? --nvidia on ubuntu or something to test nvidia-smi?
akdev
akdev•9mo ago
Use —additional-args to add the nvidia gpu argument See if the container will start
bsherman
bsherman•9mo ago
Error: OCI runtime error: unable to start container "c528757f2a29cbabc3b997067dedd6cd34060896882a52bbf2b4bb8bb2c2fef5": crun: {"msg":"error executing hook /usr/bin/nvidia-ctk (exit code: 1)","level":"error","time":"2023-10-05T03:45:57.174488Z"} distrobox create --name fedora --additional-flags "--security-opt label=disable --device=nvidia.com/gpu=0"
akdev
akdev•9mo ago
Yup, same issue
bsherman
bsherman•9mo ago
so, distrobox needs an update
akdev
akdev•9mo ago
Yeah It’s probably kind of complex update actually
bsherman
bsherman•9mo ago
hmm... so... can we set a default device in containers.conf ? 🙂 doesn't look like i'll argue the selinux repo is more active than it seems at first glance 11 months ago ahttps://github.com/NVIDIA/dgx-selinux/commit/b5ae74df51d9ec013b058d9b472df5ea231fc374 ~and a fix a couple months ago https://github.com/NVIDIA/dgx-selinux/pull/8~ not yet merged