When encoding video with ffmpeg, nvenc does not work.

DC:US-NC-1 GPU:RTX 5090 I have switched data centers to US-IL-1 in addition to US-NC-1, but the results remain the same. cmd -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 13 (Ubuntu 13.3.0-6ubuntu2~24.04)
configuration: --disable-debug --disable-doc --disable-ffplay --enable-alsa --enable-cuda-llvm --enable-cuvid --enable-ffprobe --enable-gpl --enable-libaom --enable-libass --enable-libdav1d --enable-libfdk_aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libkvazaar --enable-liblc3 --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libplacebo --enable-librav1e --enable-librist --enable-libshaderc --enable-libsrt --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-nonfree --enable-nvdec --enable-nvenc --enable-opencl --enable-openssl --enable-stripping --enable-vaapi --enable-vdpau --enable-version3 --enable-vulkan
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.101 / 61. 19.101
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100
Input #0, lavfi, from 'testsrc=duration=5:size=1280x720:rate=30':
Duration: N/A, start: 0.000000, bitrate: N/A
Stream #0:0: Video: wrapped_avframe, rgb24, 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 30 tbn
Stream mapping:
Stream #0:0 -> #0:0 (wrapped_avframe (native) -> h264 (h264_nvenc))
Press [q] to stop, [?] for help
[h264_nvenc @ 0x5b844ea0d440] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x5b844ea0d440] No capable devices found
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Error while opening encoder - maybe incorrect parameters such as bit_rate, rate, width or height.
[vf#0:0 @ 0x5b844ea2afc0] Error sending frames to consumers: Generic error in an external library
[vf#0:0 @ 0x5b844ea2afc0] Task finished with error code: -542398533 (Generic error in an external library)
[vf#0:0 @ 0x5b844ea2afc0] Terminating thread with return code -542398533 (Generic error in an external library)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Could not open encoder before EOF
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Task finished with error code: -22 (Invalid argument)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Terminating thread with return code -22 (Invalid argument)
[out#0/mp4 @ 0x5b844ea0f540] Nothing was written into output file, because at least one of its streams received no packets.
frame= 0 fps=0.0 q=0.0 Lsize= 0KiB time=N/A bitrate=N/A speed=N/A
Conversion failed!
ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 13 (Ubuntu 13.3.0-6ubuntu2~24.04)
configuration: --disable-debug --disable-doc --disable-ffplay --enable-alsa --enable-cuda-llvm --enable-cuvid --enable-ffprobe --enable-gpl --enable-libaom --enable-libass --enable-libdav1d --enable-libfdk_aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libkvazaar --enable-liblc3 --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libplacebo --enable-librav1e --enable-librist --enable-libshaderc --enable-libsrt --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-nonfree --enable-nvdec --enable-nvenc --enable-opencl --enable-openssl --enable-stripping --enable-vaapi --enable-vdpau --enable-version3 --enable-vulkan
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.101 / 61. 19.101
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100
Input #0, lavfi, from 'testsrc=duration=5:size=1280x720:rate=30':
Duration: N/A, start: 0.000000, bitrate: N/A
Stream #0:0: Video: wrapped_avframe, rgb24, 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 30 tbn
Stream mapping:
Stream #0:0 -> #0:0 (wrapped_avframe (native) -> h264 (h264_nvenc))
Press [q] to stop, [?] for help
[h264_nvenc @ 0x5b844ea0d440] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x5b844ea0d440] No capable devices found
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Error while opening encoder - maybe incorrect parameters such as bit_rate, rate, width or height.
[vf#0:0 @ 0x5b844ea2afc0] Error sending frames to consumers: Generic error in an external library
[vf#0:0 @ 0x5b844ea2afc0] Task finished with error code: -542398533 (Generic error in an external library)
[vf#0:0 @ 0x5b844ea2afc0] Terminating thread with return code -542398533 (Generic error in an external library)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Could not open encoder before EOF
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Task finished with error code: -22 (Invalid argument)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Terminating thread with return code -22 (Invalid argument)
[out#0/mp4 @ 0x5b844ea0f540] Nothing was written into output file, because at least one of its streams received no packets.
frame= 0 fps=0.0 q=0.0 Lsize= 0KiB time=N/A bitrate=N/A speed=N/A
Conversion failed!
This is the result I got running on my RTX 4090. There are no issues with the container image and command.
docker run --rm -it --gpus=all \
-v $(pwd):/config \
linuxserver/ffmpeg:7.1.1 \
-hwaccel cuda -hwaccel_device 0 -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
...
encoder : Lavc61.19.101 h264_nvenc
Side data:
cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: N/A
[out#0/mp4 @ 0x619f0fd1afc0] video:196KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 1.333094%
frame= 150 fps=0.0 q=8.0 Lsize= 199KiB time=00:00:04.90 bitrate= 332.2kbits/s speed=27.8x
docker run --rm -it --gpus=all \
-v $(pwd):/config \
linuxserver/ffmpeg:7.1.1 \
-hwaccel cuda -hwaccel_device 0 -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
...
encoder : Lavc61.19.101 h264_nvenc
Side data:
cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: N/A
[out#0/mp4 @ 0x619f0fd1afc0] video:196KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 1.333094%
frame= 150 fps=0.0 q=8.0 Lsize= 199KiB time=00:00:04.90 bitrate= 332.2kbits/s speed=27.8x
No description
No description
229 Replies
Dj
Dj2mo ago
NVENC and ffmpeg are very very sensitive to your nodes specific driver version. When you're deploying your pod, used the advanced filter to select CUDA 12.8 and 12.9.
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
I have already made that specification. It was clear that it would not work with either 12.8 or 12.9. I built ffmpeg from scratch inside the container, but that made no difference. None of the ffmpeg versions I tried worked. Please tell me which data centers have an RTX 5090 where NVENC works. If it is my issue, I will gladly take care of it.
Dj
Dj2mo ago
NVENC is a chip on the graphics card included on every NVIDIA device produced after 2006. Have you tried any other ffmpeg version (8.0+) or any of the mainline alternative builds? https://github.com/BtbN/FFmpeg-Builds/releases/tag/latest Youll want ffmpeg-master-latest-linux64-lgpl.tar.xz. You can extract this with tar xf ffmpeg-master-latest-linux64-lgpl.tar.xz
riverfog7
riverfog72mo ago
from what I know the AI training chips (A100 H100) doesn't have NVENC only NVDEC but all consumer & worktation grade cards do
riverfog7
riverfog72mo ago
NVIDIA Developer
Video Encode and Decode GPU Support Matrix
Find the related video encoding and decoding support for all NVIDIA GPU products.
riverfog7
riverfog72mo ago
works with RTX 2000 Ada
No description
riverfog7
riverfog72mo ago
doesn't work with 5090
riverfog7
riverfog72mo ago
No description
riverfog7
riverfog72mo ago
same error with community cloud
riverfog7
riverfog72mo ago
OBS Forums
OBS not working with RTX 5090 (NV_ENC_ERR_INVALID_DEVICE)
Hey! I'm not entirely sure if this is going to be an isolated case or a bigger issue across other RTX 5090 Cards. I'll gladly help find a solution for this & hopefully will help find a fix for other 50 Cards users as well. I have spent over a week working around with OBS in an attempt to fix...
riverfog7
riverfog72mo ago
looks like a driver issue
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
Yes. I have tried all the items you pointed out and have confirmed that it is impossible to execute them. Of course, I am also using ffmpeg-master-latest-linux64-lgpl.tar.xz. No, I don’t think so. This is a data center issue. On the RTX 4090, the same error occurs in US-IL-1, EUR-NO-1, and EUR-IS-2, but encoding works normally on EU-RO-1. Also, some RTX 5090s have been confirmed to work. However, it’s no longer possible to get assigned to that pod.
riverfog7
riverfog72mo ago
That's strange But i heard that runpod is using early driver versions So I thought it was a driver issue
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
I will present evidence that supports my claim. This is a serverless endpoint that runs the command -f lavfi -i testsrc=duration=600:size=1920x1080:rate=60 -c:v h264_nvenc -preset p1 -b:v 10M -pix_fmt yuv420p -f null - -benchmark -stats using the linuxserver/ffmpeg:version-8.0-cli image on a serverless worker. The bad workers show the same error I reported, while the properly functioning workers correctly display encoding speed logs. Therefore, this cannot be concluded as a driver issue, since there are GPU servers that operate normally. The four attached screenshots are logs from the workers that function correctly. The fifth image shows a list of both healthy and unhealthy workers. (All unhealthy ones output the error I posted and then stopped processing.) And as an important detail, I have confirmed that there is no difference in the Driver Version between the bad workers and the healthy ones. Therefore, this is a data center issue.
No description
No description
No description
No description
No description
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
I have presented evidence that this is a data center issue. Please investigate. I would appreciate it if you could escalate the ticket.
riverfog7
riverfog72mo ago
this is strange NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9: work (EU-RO-1) NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 : no work (EUR-IS-2) NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: no work (EUR-IS-1) NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 : no work (EU-RO-1) NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1) NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1) NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 : no work (EUR-IS-1) it doesn't depend on driver version? seems like EUR-RO-1 uses newer drivers but it sometimes doesn't work there too
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
Yes. NVENC reacts sensitively to things like driver version, but even with exactly the same version, in exactly the same DC, and with exactly the same configuration, differences in behavior can be observed. As you know, EUR-IS-2 is 570.172.08, but there are workers operating normally on 570.172.08, and there are also bad workers on 570.172.08. The attached image shows information from a worker that operated normally.
No description
No description
No description
No description
riverfog7
riverfog72mo ago
Yeah I can confirm And i didnt know that runpod dashboard shows driver version
Dj
Dj2mo ago
I really appreciate you looking into this. I'll have this escalated. Are you able to share the ids of these deployments or no? It's okay if not I can get it all manually.
riverfog7
riverfog72mo ago
8sv5k7ublivjhq should be the serverless endpoint ID
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
Thank you very much. Since I would like to track the situation, please escalate from this thread and issue a ticket. skwe5i0dbvkzeu This is the serverless endpoint ID that I presented as evidence.
Poddy
Poddy2mo ago
@KaSuTeRaMAX
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #23406
neutel_
neutel_2mo ago
Any news on this issue?
KaSuTeRaMAX
KaSuTeRaMAXOP2mo ago
In the support ticket, they reported: I’ve already escalated this case to our reliability team for deeper review. It seems they are currently investigating. I will share any new information here as soon as it becomes available. Runpod sincerely conducted additional investigation and provided support. I will share the details of the ticket. ---------------------------------------------------------------------------------------------------------- Following up on your request, we’ve reproduced the NVENC failure you reported across multiple regions and GPU types. After investigation, we’ve classified this as an upstream issue with FFmpeg/NVIDIA. Specifically, the problem appears to stem from how device indices are mapped inside containers (when /dev/nvidia* devices don’t align with nvidia-smi indices). This behavior matches several active upstream reports: https://trac.ffmpeg.org/ticket/11694 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197 https://github.com/NVIDIA/k8s-device-plugin/issues/1282 Since the root cause lies upstream, a permanent fix will need to come from the FFmpeg/NVIDIA teams. That said, we’ll continue to monitor developments closely and will keep you updated on any relevant progress or workarounds. ----------------------------------------------------------------------------------------------------------
flexgrip
flexgrip2w ago
I am seeing the same issue @KaSuTeRaMAX. Thank god I found this. I thought I was going crazy. Seems to be a random roll of the dice on whether a pod will work or not. That being said, I have not seen any of these errors on the serverless endpoints. I wonder if it is safe to depend on the serverless endpoints for large batches of requests or if this only affects the pods? I will read the issues mentioned in the runpod support response. Perhaps we can just ship our own binaries that have a fix. Seeing a lot of talk about the 570 driver being the culprit. When I launch RTX PRO 6000 pods I always get nvidia driver version >= 580 and have yet to see the issue on there. But that could be anecdotal. It seems there is nothing we can do to fix the problem itself. But I am going to try getting the /dev/nvidia# and passing that into ffmpeg with ffmpeg -hwaccel_device 0 Maybe that will work. Going to bed but will continue tinkering with it tomorrow.
riverfog7
riverfog72w ago
this might support that clain only servers with driver version 570 didn't work
flexgrip
flexgrip2w ago
Yeah as I understand the issue, it's with the way multiple gpu servers start the docker container. Whatever fancy way runpod spins them up, they are probably doing something like device=1 if the first gpu is already being used in another container. As I found out last night, you can spin up the exact same image over and over on the same gpu type. Seems like unless you get gpu 0, you get this problem. The thing is, I have not seen this issue even once with RTX PRO 6000 machines. I'm gonna see right now if I can reproduce the issue, and then work around it by specifying which device to enumerate in ffmpeg export CUDA_VISIBLE_DEVICES=0; ffmpeg ... seems to have some effect. At least I get a different error when I set different device ids. Just for fun, I tried symlinking /dev/nvidia1 to /dev/nvidia0 and it had no effect. I am kind of out of ideas on how to fix this. It feels like if you get a set of circumstances, you just can't work around it. 1. You select a gpu pod that isn't having all the gpus passed to it 2. Someone else is using gpu 0 3. You are on driver ~570 (although I havent confirmed this) Curious why I never saw this on the serverless endpoints. That's what I indend to use anyway so if it works there, then no big deal. But I can't really finish my project if there's a random chance each serverless endpoint worker might fail too. Btw. I just launched a pod and got the same exact scenario where I got /dev/nvidia1 but since it is driver 580, it seems to work just fine
riverfog7
riverfog72w ago
So the only workaround is using a dofferent gpu? Or maybe you can set higher cuda version To avoid 570 drivers
flexgrip
flexgrip2w ago
I guess we can try. But I didn't think they were related. Like if I use a container image that is 12.2, then I get into a pod using an RTX PRO 6000, it will tell me its running cuda 13 I think that filter when you are making a pod is exactly that. Just a filter, showing which gpus are compatible with that version of cuda you've selected. 13 isn't even selectable on there and no matter what I choose, it seems to just have whatever the host system gives you when it is launching the container. Which makes sense.
KaSuTeRaMAX
KaSuTeRaMAXOP2w ago
>Curious why I never saw this on the serverless endpoints. I just re-tested this issue on the serverless endpoints, and I can confirm that the same problem is still occurring as before.
flexgrip
flexgrip2w ago
Oh really 🤔 Maybe because I only really tested on RTX PRO 6000’s I’ve yet to see one of those gpu’s have this issue. And all have been on driver >= 580
riverfog7
riverfog72w ago
its the host cuda version when you do nvidia-smi it will print out the cuda version and the driver version and maybe RTX PRO 6000s don't support lower version of drivers so they don't get 570 driver version
flexgrip
flexgrip2w ago
Yeah. It seems to print out whatever version the host uses. So your base image can be like 12.2 and the host can be 13 as reported by nvidia-smi. But the filter selector when creating a pod seems to have no bearing on what version of cuda the host system uses. Like if I select 12.4 I will get 12.2 installed by my image and 13 according to nvidia-smi
riverfog7
riverfog72w ago
that's wierd I always got what I asked for but if you set it to 12.8 it should give 12.8+ at least and that should avoid 570 drivers
flexgrip
flexgrip7d ago
There's a chance I don't know what I am talking about. But I am pretty confident that the cuda filter does absolutely nothing besides filter which gpu's are compatible with the version of cuda you need. Unrelated to that though, I just saw my first failure on the serverless endpoints because I got an RTX 5090 running driver 570. Weird. I just had an RTX PRO 6000 worker run one of my tasks and it was running driver 570.195.03. It worked So I must have had gpu #0 I guess.
flexgrip
flexgrip7d ago
Great find!
riverfog7
riverfog77d ago
can you try setting cuda version just in case it works
flexgrip
flexgrip7d ago
Where at? In my image or in the selector when editing the pod/worker?
riverfog7
riverfog77d ago
here
riverfog7
riverfog77d ago
No description
No description
riverfog7
riverfog77d ago
Probably to cuda 12.8
flexgrip
flexgrip6d ago
That selector seems to somewhat help at choosing the host's cuda version. But there are times when I will choose a specific version and get back a different one. I just tested three pods. Two gave me 12.8 and the last one gave me 13
riverfog7
riverfog76d ago
I think its 12.8+ So you are getting 13 too Since cuda should be backwards compatible And newer cuda version == newer drivers So this should help in getting newer drivers
flexgrip
flexgrip6d ago
I think you were right that the selector works like that. I just noticed if I select 12.8 it tells me that the RTX PRO 6000 pods are unavailable, but if I select 12.9 they are available. So that case is closed But related to this issue. If I get driver 570 there’s a chance it doesn’t work. But if I select 12.9 and get driver version 580 with the same gpu, it works. I guess it’s possible that it’s anecdotal and I’ve just been lucky to get gpu number 0. But I am pretty confident driver 580 doesn’t have this issue
riverfog7
riverfog76d ago
Then selecting 12.9 will make you avoid 570 driver Because 12.9 is 580+ @Dj is this a bug? Either this is a bug or 12.8 selector giving cuda13 is a bug
flexgrip
flexgrip6d ago
5090 🟥 Driver Version: 570.153.02 CUDA Version: 12.8 🟥 Driver Version: 575.57.08 CUDA Version: 12.9 RTX PRO 6000 🟥 Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia5 ✅ Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia0 ✅ Driver Version: 570.195.03 CUDA Version: 12.8 /dev/nvidia1 <-- ?? RTX PRO 6000 WK ✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2 RTX PRO 6000 WK (no cuda selection) ✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2 RTX PRO 6000 WK NORTH AMERICA (no cuda selection) ✅ Driver Version: 580.65.06 CUDA Version: 13.0 /dev/nvidia5 No clue how to get it to give me 580 with cuda 13 It seems random I tried selecting nothing, which is what I've done in the past to get it to give me 580. It gave me 12.9. So maybe it's just random or a regional thing. Boom Selected north america 🇺🇸 Driver Version: 580.65.06 CUDA Version: 13.0 My serverless endpoint workers are all NA except one. But I don't think you can choose where your workers come from. Oh yes you can. In the advanced section. Perfect. So if we can get a list of what each regions pods are running for cuda, I could technically go to production with this. maybe
riverfog7
riverfog76d ago
riverfog7
riverfog76d ago
So its cuda 12.x? Im confused now
flexgrip
flexgrip6d ago
I’m confused too 😂 My working theory is that driver 580 seems to not have this problem. However, it seems to not happen at all on the RTX PRO 6000 WK But the only way I can get driver 580 has been to use that card. So 🤷
riverfog7
riverfog76d ago
Hiw about Cuda 13? How* Does that give you driver 58* consistently? 58X
flexgrip
flexgrip6d ago
Every time so far, yes If I see 580 I see cuda 13
flexgrip
flexgrip6d ago
If you look at my table above, some of it is a friggin mystery still. Like that one time I got the 3rd nvidia card in my container, and it still worked
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
so if you pick cuda 13 you get nvidia driver >= 580
flexgrip
flexgrip6d ago
I wish I could pick cuda 13 Maybe runpod can tell us which regions are running 580/13.0
riverfog7
riverfog76d ago
yeah
flexgrip
flexgrip6d ago
But right now if you select north America and don’t pick any cuda version, I have gotten 580/13 100% of the time. But that’s like 15 locations so I definitely haven’t confirmed if all of them have it. So I will need runpod to confirm so I can filter my serverless endpoint to only select those. But again, I still don’t know how some of those tests I ran worked. They were not getting the “default” gpu 0 and they were not on 580 and they worked. What’s real? Is the sky blue? Are birds real?
riverfog7
riverfog76d ago
oh wait it works with API @flexgrip
riverfog7
riverfog76d ago
No description
flexgrip
flexgrip6d ago
I was gonna check that. See if I can use the api to specify cuda 13
riverfog7
riverfog76d ago
they just didn't give that option in the UI
flexgrip
flexgrip6d ago
You are a beauty
riverfog7
riverfog76d ago
lol
flexgrip
flexgrip6d ago
So you just set it to cuda: 13 and it made that?
riverfog7
riverfog76d ago
this
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
works
flexgrip
flexgrip6d ago
Hell yeah
riverfog7
riverfog76d ago
and setting that to 14.0 fails so they are doing some kind of checking even if the option is invalid 13.1 also fails like this
riverfog7
riverfog76d ago
No description
flexgrip
flexgrip6d ago
That was you trying to make a cuda 13 instance or an invalid version number?
riverfog7
riverfog76d ago
invalid version number cuda 13.1 or cuda 14 requested returns that error cuda 13.0 (you need the .0) works fine and pods spun up like that have cuda 13.0
flexgrip
flexgrip6d ago
This is great. This only leaves some type of confirmation of what is causing this. Or not what is causing it, but what is causing it to work. I keep wondering… just because I haven’t seen the error on driver 580 doesn’t mean I won’t. I feel like I could just as easily say RTX PRO 6000 WK’s don’t have the issue either.
riverfog7
riverfog76d ago
yeah you have to test if it works in that driver version
flexgrip
flexgrip6d ago
I guess I could automate a test for this and just smash the api with it I guess I’ve been assuming that if I get a pod and see /dev/nvidia5 that that means I’ve got gpu index 6 But in a few examples above when I was testing earlier I got some successes with /dev/nvidia1 and /dev/nvidia2 so maybe that device number doesn’t mean the index? If so, how could we tell?
riverfog7
riverfog76d ago
GitHub
who creates /dev/nvidia0 · NVIDIA open-gpu-kernel-modules · Discu...
Thank you very much for your answer. The problem I encountered is: I can get the PCI device number of NVIDIA graphics card, such as (81:00.0). I want to use this device number to correspond to my l...
flexgrip
flexgrip6d ago
Hmm. I wonder if that device minor number is visible from inside the container
riverfog7
riverfog76d ago
I made a script ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y is this command right? and it looks like not many machines with 5090s that have cuda 13 is available
riverfog7
riverfog76d ago
maybe I am wrong with the testin g if this command is right this should be correct
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
idk at this point note that 5090s with cuda 13 is rare i think there is 1 machine available
flexgrip
flexgrip6d ago
That command looks right.
riverfog7
riverfog76d ago
i can't get more than 4 cocurrently
flexgrip
flexgrip6d ago
How are you checking for pass/fail? If ffmpeg returns anything but 0?
riverfog7
riverfog76d ago
this
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
Conversion failed! if that's in stderr its a failed run
riverfog7
riverfog76d ago
this si the raw output
flexgrip
flexgrip6d ago
Gotcha. Yeah I was greping for grep -q "No capable devices found"
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
doesn't this look off no "No capable devices found" oh there is
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
anyways I would like to test with RTX6000 blackwell but im broke 😢
flexgrip
flexgrip6d ago
I've been sitting here trying to patch ffmpeg.
flexgrip
flexgrip6d ago
No description
riverfog7
riverfog76d ago
can i ask why you need ffmpeg? at this point i think using other platforms to transcode is better lol
flexgrip
flexgrip6d ago
I'm trying to use nvenc to convert archives of videos to streamable formats. I guess I could try gstreamer instead
riverfog7
riverfog76d ago
non-blackwell cards fail too?
flexgrip
flexgrip6d ago
I dunno. I haven't tried them much because they're slower and don't have 9th gen nvenc The newer blackwell have 4 nvenc and seem to blow the L4 gpu's from google cloud run out of the water. At least with this encoding stuff.
riverfog7
riverfog76d ago
Yeah ik Community vloud seems to work better Rtx pro 6000 maxqs were all working
flexgrip
flexgrip6d ago
gstreamer seemed to work I haven't thoroughly tested it but I think its slower than ffmpeg
riverfog7
riverfog76d ago
Is it using cpu?
flexgrip
flexgrip6d ago
That’s what I’m trying to confirm. I ran an nvenc test on the gst-bad nvenc plugin and it seemed to spit out a video. Can’t tell if it falls back to cpu
riverfog7
riverfog76d ago
maybe try seeing if nvidia card pulls more power when encoding or if gstreamer uses more than 100% cpu while encoding
flexgrip
flexgrip6d ago
I was deleting a couple pods and accidentally deleted the one I compiled gstreamer on. So now I gotta go through all that mess again.
riverfog7
riverfog76d ago
why not binary releases?
flexgrip
flexgrip6d ago
But I will. If gstreamer can do it, then ffmpeg clearly can be patched I don’t know if there are any precompiled binaries for the gstreamer gst plugins for nvenc. I didn’t really look though
riverfog7
riverfog76d ago
Stack Overflow
How to install gstreamer nvcodec vs nvdec/nvenc plugins on Ubuntu 2...
Installed gstreamer and gstreamer-plugins-bad on ubuntu 20.04 via the apt repo. I also installed the Video_Codec SDK 11.0 from Nvidia. The gst-ispect command shows me nvenc and nvdec is installed ...
riverfog7
riverfog76d ago
it says you'll get them automatically try it on runpod pytorch official template that has ubuntu 22 from what I know
flexgrip
flexgrip6d ago
Oh cool
riverfog7
riverfog76d ago
hmm
flexgrip
flexgrip6d ago
The gst-plugins-bad with Ubuntu doesn’t have nvenc. I’ll have to compile it again in the morning.
riverfog7
riverfog76d ago
it wont work gst-inspect-1.0 nvcodec try this it does have nvcodec
riverfog7
riverfog76d ago
No description
riverfog7
riverfog76d ago
but no nvh264enc?
flexgrip
flexgrip6d ago
Yep
riverfog7
riverfog76d ago
GitHub
env-setup/gst-nvidia-docker at master · jackersson/env-setup
Useful scripts, docker containers. Contribute to jackersson/env-setup development by creating an account on GitHub.
riverfog7
riverfog76d ago
looks like someone made a script
Dj
Dj6d ago
woah huge thread Great to see you all working on it, but the 12.8 selector should not give you CUDA 13 machines. However you're correct in that a non zero amount of our servers are on 13.0 - but it's not an amount that I can guarantee will always be available. And naturally we'll do what we can from the backend^ Just a little awkward while we're running maintenance.
riverfog7
riverfog75d ago
All community cloud instances I have tested worked @Dj This is strange All instances with RTX PRO 6000 didn't try 5090s yet
Dj
Dj5d ago
Work or don't work? I think one of the facets of this is the hosts operating system/kernel version. Let me take a look
riverfog7
riverfog75d ago
worked all community cloud instances with RTX PRO I wasn't able to spin up much because there was not many available
flexgrip
flexgrip5d ago
Every RTX PRO 6000 WK I have tried worked no matter what version or driver or which enumerated gpu I got
riverfog7
riverfog75d ago
@Dj would you like the pod ids? this is secure cloud the folder names are pod ids
Dj
Dj5d ago
Interesting, on secure cloud all of these machines use Ubuntu 24.04.2 or 24.04.3. The unsecure cloud host uses Ubuntu 22.04.5. So probably not the operating system.
flexgrip
flexgrip5d ago
I’ll try the community cloud instances tonight.
Dj
Dj5d ago
There's not a lot and I can't guarantee the availibility.
riverfog7
riverfog75d ago
yeah i got total under 10 pods
Dj
Dj5d ago
I can tell you we have 1 machine on the community cloud with this GPU, and one machine physically cannot support more than 8 GPUs.
riverfog7
riverfog75d ago
and half of them didn't even work (probably pulling image)
Dj
Dj5d ago
maybe its not one machine but its 1 OS and that usually indicates one machine it's 2 :) @riverfog7 What cuda version was the one you were on? If you know, it could've only been 13.0 or 12.8.
riverfog7
riverfog75d ago
has 12.9
riverfog7
riverfog75d ago
No description
Dj
Dj5d ago
uh? nvcc --version
riverfog7
riverfog75d ago
is nvidia-smi inaccurate? I can't ssh back because the test is automated
Dj
Dj5d ago
I learned recently nvidia-smi will show the highest cuda version the driver supports
riverfog7
riverfog75d ago
pod is already terminated
Dj
Dj5d ago
o7 thats fine Would you happen to have the prompt from the pod? root@12345678... Or does the output only give you the result of nvidia-smi?
riverfog7
riverfog75d ago
8mmj1nmc6r2ksh this is the podid
Dj
Dj5d ago
perfect
riverfog7
riverfog75d ago
i don't have the prompt
Dj
Dj5d ago
We have this machine listed as cuda 12.9 Weird when I queried for it it showed as 12.8
riverfog7
riverfog75d ago
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 8mmj1nmc6r2ksh drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mgrd9lo1q1bptd drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mx91o0m3l84i0c drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 uirv051063dg6j drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yh7yf9vtb4o2k1 drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yl08tpmbn6esh9 these are the ones i tested
Dj
Dj5d ago
I just opened this, this is excellent actually
riverfog7
riverfog75d ago
I do have some 12.8 ones too
Dj
Dj5d ago
If you do manage to find a correlation let me know, not that you're obligated to and I can very easily create (or run?) a script that simulates a bunch of different variables to pull details. I think we have this chalked down to these issues from the last time we got a report like this: https://trac.ffmpeg.org/ticket/11694 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209 https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197 https://github.com/NVIDIA/k8s-device-plugin/issues/1282
riverfog7
riverfog75d ago
OP said gstreamer worked so it might be a bug in ffmpeg
Dj
Dj5d ago
But this issue was reopened, by another? customer just yesterday with the following reproduction and we rolled that up into this too. We discussed a few workarounds, but aren't happy with any of them as they all have their own issues. We know it's ffmpeg, we just don't know really about the details or why. https://trac.ffmpeg.org/ticket/11694
riverfog7
riverfog75d ago
only correlation is this?
No description
riverfog7
riverfog75d ago
stupidly blue is failure red is sucess
Dj
Dj5d ago
Does it help to know that after our maintenance the lowest driver version in the fleet will be 570.195.03?
riverfog7
riverfog75d ago
No description
flexgrip
flexgrip5d ago
That’s the driver I have the most issues with 😂 I wish I had escalated privs on one of these nodes so I could test a few things. For example I wonder if this could be fixed by running mknod with the major and minor from /dev/nvidiaX
MAJOR=$(stat -c '%t' /dev/nvidiaX)
MINOR=$(stat -c '%T' /dev/nvidiaX)

mknod /dev/nvidia0 c 0x$MAJOR 0x$MINOR
chmod 666 /dev/nvidia0
MAJOR=$(stat -c '%t' /dev/nvidiaX)
MINOR=$(stat -c '%T' /dev/nvidiaX)

mknod /dev/nvidia0 c 0x$MAJOR 0x$MINOR
chmod 666 /dev/nvidia0
riverfog7
riverfog75d ago
these are community 5090s forgot to mention it
Dj
Dj5d ago
I don't have ssh access to the hosts but I do have a lot of other permission. @riverfog7 I can message you a credit to continue your testing if you'd like.
riverfog7
riverfog75d ago
sure but its midterm season soon so i don't know if i can continue for long 😅
Dj
Dj5d ago
ah i understand
flexgrip
flexgrip5d ago
I still wonder why the 6000 WK has worked no matter what version I get. I guess we need to check the index in nvidia-smi to see if it’s just luck This is the kind of bug I used to love working on when I was at nvidia. Don’t have access to those kind of testing rigs anymore though.
riverfog7
riverfog75d ago
is there any stats that is nice to have when debugging
riverfog7
riverfog75d ago
what is this?
No description
riverfog7
riverfog75d ago
display attatched = True?
riverfog7
riverfog75d ago
No description
flexgrip
flexgrip5d ago
Is that the thingy you need to set if you want to do stuff like vnc or X11 forwarding?
riverfog7
riverfog75d ago
i don't know about that part well
riverfog7
riverfog75d ago
this time its 280 instances of 5090s mixed between community and secure cloud
flexgrip
flexgrip5d ago
Is your test easily runable? I don’t mind to burn through credits testing other scenarios Other gpu’s I should say
riverfog7
riverfog75d ago
its just doing this
No description
riverfog7
riverfog75d ago
in a template
riverfog7
riverfog75d ago
No description
No description
No description
No description
riverfog7
riverfog75d ago
No description
No description
No description
riverfog7
riverfog75d ago
autogluon feature importance
No description
riverfog7
riverfog75d ago
it just identified every gpu by id idk at this point this is probably a software bug inside nvidia or ffmpeg
flexgrip
flexgrip4d ago
Yeah I read all the tickets and associated links this morning. Nothing seems to be reliable in the reproduction. Some people say you need to be GPU 0, others say the last gpu is working or an odd number in between. Some say the bug is a regression starting at driver 570. Others have reproduced it on 550 and lower. Nobody seems to be focused on fixing it. Some ffmpeg references say its an issue with nvcodec itself. Just run several simultaneous iterations of a quick encoding task on the PRO 6000 WK. Not a single one failed. All gave me 580/13 First test on a PRO 6000, 575/12.9 fail Had a failure on a serverless worker. I didn't catch it in time to see the logs. But the only difference is that it was not in NA.
flexgrip
flexgrip4d ago
No description
flexgrip
flexgrip4d ago
I just had an idea to try and do a health check on a serverless endpoint. My question is, once a worker gets my image and goes idle, does it already have this issue or not? Was hoping I could do a health check and if it fails, the worker terminates and a new one is created until I am left with nothing but workers without this issue. But when my health check fails, whatever is orchestrating the containers just restarts it instead of terminating and launching on a new worker. Wonder if I can fail with a different error code to get it to terminate?
riverfog7
riverfog74d ago
you can self destruct with this in a pod runpodctl remove pod ${RUNPOD_POD_ID} I'm not sure about serverless they come with pod scoped api keys afaik so no need to configure credentials
flexgrip
flexgrip4d ago
I’ll give it a shot. I don’t know if my concept is flawed though. Like, if a worker starts the container and doesn’t get the error in ffmpeg, does that mean when a request comes in hours later that it still won’t run into this bug? I guess I don’t know how the serverless workers are orchestrated. The question is, is the gpu already assigned when the worker goes idle? And if so, does it stay that way?
riverfog7
riverfog74d ago
@Dj is there an api to terminate serverless workers individually? its possible in serverless console thingey
flexgrip
flexgrip4d ago
If so, this fixes everything for me. Just takes a little longer to deploy
riverfog7
riverfog74d ago
but you have to consider the possibility of getting the same host you can see in my test that there are multiple overlapping gpu ids
flexgrip
flexgrip4d ago
What do you mean?
riverfog7
riverfog74d ago
No description
flexgrip
flexgrip4d ago
I think I’m saying the same thing as you
riverfog7
riverfog74d ago
test ffmpeg fails -> terminate worker -> runpod spins up same worker can happen
flexgrip
flexgrip4d ago
Ohh
riverfog7
riverfog74d ago
probably not a problem if there are many GPUs but since you are using the RTX PRO 6000 and they have limited supply
flexgrip
flexgrip4d ago
Well when I manually terminate one I always get back a worker with a different id. But I don’t know if that’s unique or not I’m using L40, L40S, RTX PRO 6000 and I think one other gpu
riverfog7
riverfog74d ago
the id should be different
riverfog7
riverfog74d ago
No description
riverfog7
riverfog74d ago
pod id is different here but GPU uuid is same for some pods
flexgrip
flexgrip4d ago
I guess I don’t know what a worker truly is. Is it a shared server? Is it a shared cluster? Etc. Because it could totally just be server rack that picks up requests from the queue and runs docker run -it … -gpus=5 (not really but you get the idea). That means it could work for one request then fail the next. Otherwise if it’s consistent, then I’m ok with this janky solution.
riverfog7
riverfog74d ago
My opinion is a serverless worker is just a pod and worker id = pod id and works with the same infrastructure, hence can share network volumes
flexgrip
flexgrip4d ago
That’s what I think too. Pods + orchestration = serverless worker
riverfog7
riverfog74d ago
ECS in AWS terms but with an API Gateway and a queue and cloudfront
flexgrip
flexgrip4d ago
Yep. So if I can kill a worker during the initial health check, it may be a workable solution. Provided I don’t get the same one over and over 🤔
riverfog7
riverfog74d ago
so this point still stands this is only a problem when there is like 10 GPUs available and 7 of them are not working but you are trying to get 5 workers and you get broke because you get billed for the ffmpeg health check time
flexgrip
flexgrip4d ago
Oh I never thought about it billing me for the deploy time
riverfog7
riverfog74d ago
you should get billed for the health check because the container is already starte d
flexgrip
flexgrip4d ago
That’s a good point. I’ll have to see where I can run this in the lifecycle I put the health check right before the serverless handler and deployed and noticed some of the instances kept initializing. So I assumed it was running and I just couldn’t see the logs
riverfog7
riverfog74d ago
i think its failing the health check and the host just restarts it pods do the same thing when containers exit abnormally host starts it until it works do you have anything in serverless console-> logs instead of serverless console->workers->worker->logs
flexgrip
flexgrip4d ago
Nothing from the deploy. So it’s either not running the health check or none of the logs it generates during a deploy are in those logs I can see The only thing I changed was adding the health check and the only two outcomes I saw were workers initializing over and over or becoming ready and successfully handling requests.
riverfog7
riverfog74d ago
I just spun up a random serverless endpoint and it looks like "Initializing" is pulling images and extracting them and "running" is the actual container running so if worker is created initialize -> running (load model in memory and health check, etc) -> idle (waits till request)
flexgrip
flexgrip4d ago
I only ever get the running state when I send a request. I just get initializing -> idle
riverfog7
riverfog74d ago
oh I think i set this to get the worker count to go up
No description
riverfog7
riverfog74d ago
that makes sense i think im wrong then my question is "Is anything happening after the container starts and before serverless start billed?"
riverfog7
riverfog74d ago
its not really clear from this explaination
No description
flexgrip
flexgrip4d ago
What I thought was happening is deploy > workers are assigned and they all pull your image Then request > container starts But after deploy, does it run the container at all. If so, health check + terminate on fail would work. Otherwise it won’t I’ve never noticed it charging me for the deploy phases Hmm. I don’t think this will work now. There is no health check I can find in the docs for queue based serverless Ugh. I am spending too much time thinking about this each night. I gotta just implement my own queue and let the occasional failures retry. At most, terminate failed workers
riverfog7
riverfog74d ago
Does dockerfile health checks work?
flexgrip
flexgrip4d ago
I don’t think the worker is even running the container until you send a request to it. At least that’s my theory
riverfog7
riverfog74d ago
Hmm
flexgrip
flexgrip4d ago
I wonder how Google gets around this
riverfog7
riverfog74d ago
google?
flexgrip
flexgrip4d ago
With cloud run gpu instances. We’ve processed lots of video using those and never ran into this problem. They’re on L4 gpus
riverfog7
riverfog74d ago
idk about cloud run but in AWS ECS containers run inside vms not like runpod (shared host)
flexgrip
flexgrip4d ago
Google cloud run is using docker. At least for their second gen runtimes Maybe that’s the answer. Just run a docker container inside the docker container 😅
riverfog7
riverfog74d ago
good news serverless workers count as pods ig so runpodctl remove pod <workerid> works
if eval ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y > /dev/null 2> /dev/null; then
echo "FFMPEG with NVENC encoding succeeded."
else
echo "FFMPEG with NVENC encoding failed."
runpodctl remove pod ${RUNPOD_POD_ID}
fi
if eval ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y > /dev/null 2> /dev/null; then
echo "FFMPEG with NVENC encoding succeeded."
else
echo "FFMPEG with NVENC encoding failed."
runpodctl remove pod ${RUNPOD_POD_ID}
fi
should work if the env variable is correct and runpodctl is installed in the pod
flexgrip
flexgrip4d ago
Sorry. Wife aggro. I'll check this out. So with that scenario, the job will fail, but it will take the worker out with it.
Dj
Dj4d ago
Correct. To the API a serverless worker is a pod named after the endpoint id :')
flexgrip
flexgrip4d ago
And there's no way during a deploy to tap into any of the health checks?
riverfog7
riverfog74d ago
That should depend on the docker cmd running at worker initialization
flexgrip
flexgrip2d ago
Yeah I don’t think the container itself runs. I suspect the pod just fetches the image and makes sure everything is ready for when it receives a request Ok new tactic... My API forms the request, sends it to the serverless endpoint, then receives the webhook back on success or fail. On fail, it retries. Meanwhile, on the worker, if we get the bad cuda ffmpeg response about no supported devices, we terminate the worker. Provided there is enough delay, the retry will go to the next worker while the last one was being terminated. So the system should only waste a few seconds on retries and terminating before eventually hitting a successful worker. This also has the added side effect of constantly pruning bad workers. So far its working as expected
riverfog7
riverfog72d ago
Hmm

Did you find this page helpful?