When encoding video with ffmpeg, nvenc does not work.
DC:US-NC-1
GPU:RTX 5090
I have switched data centers to US-IL-1 in addition to US-NC-1, but the results remain the same.
cmd
-f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
This is the result I got running on my RTX 4090. There are no issues with the container image and command.


229 Replies
NVENC and ffmpeg are very very sensitive to your nodes specific driver version. When you're deploying your pod, used the advanced filter to select CUDA 12.8 and 12.9.
I have already made that specification.
It was clear that it would not work with either 12.8 or 12.9.
I built ffmpeg from scratch inside the container, but that made no difference.
None of the ffmpeg versions I tried worked.
Please tell me which data centers have an RTX 5090 where NVENC works.
If it is my issue, I will gladly take care of it.
NVENC is a chip on the graphics card included on every NVIDIA device produced after 2006. Have you tried any other ffmpeg version (8.0+) or any of the mainline alternative builds?
https://github.com/BtbN/FFmpeg-Builds/releases/tag/latest
Youll want ffmpeg-master-latest-linux64-lgpl.tar.xz. You can extract this with
tar xf ffmpeg-master-latest-linux64-lgpl.tar.xzfrom what I know the AI training chips (A100 H100) doesn't have NVENC
only NVDEC
but all consumer & worktation grade cards do
NVIDIA Developer
Video Encode and Decode GPU Support Matrix
Find the related video encoding and decoding support for all NVIDIA GPU products.
works with RTX 2000 Ada

doesn't work with 5090

same error with community cloud
https://obsproject.com/forum/threads/obs-not-working-with-rtx-5090-nv_enc_err_invalid_device.184606/
OBS Forums
OBS not working with RTX 5090 (NV_ENC_ERR_INVALID_DEVICE)
Hey!
I'm not entirely sure if this is going to be an isolated case or a bigger issue across other RTX 5090 Cards.
I'll gladly help find a solution for this & hopefully will help find a fix for other 50 Cards users as well.
I have spent over a week working around with OBS in an attempt to fix...
looks like a driver issue
Yes. I have tried all the items you pointed out and have confirmed that it is impossible to execute them.
Of course, I am also using ffmpeg-master-latest-linux64-lgpl.tar.xz.
No, I don’t think so. This is a data center issue.
On the RTX 4090, the same error occurs in US-IL-1, EUR-NO-1, and EUR-IS-2, but encoding works normally on EU-RO-1.
Also, some RTX 5090s have been confirmed to work. However, it’s no longer possible to get assigned to that pod.
That's strange
But i heard that runpod is using early driver versions
So I thought it was a driver issue
I will present evidence that supports my claim.
This is a serverless endpoint that runs the command
-f lavfi -i testsrc=duration=600:size=1920x1080:rate=60 -c:v h264_nvenc -preset p1 -b:v 10M -pix_fmt yuv420p -f null - -benchmark -stats
using the linuxserver/ffmpeg:version-8.0-cli image on a serverless worker.
The bad workers show the same error I reported, while the properly functioning workers correctly display encoding speed logs.
Therefore, this cannot be concluded as a driver issue, since there are GPU servers that operate normally.
The four attached screenshots are logs from the workers that function correctly.
The fifth image shows a list of both healthy and unhealthy workers. (All unhealthy ones output the error I posted and then stopped processing.)
And as an important detail, I have confirmed that there is no difference in the Driver Version between the bad workers and the healthy ones.
Therefore, this is a data center issue.




I have presented evidence that this is a data center issue. Please investigate. I would appreciate it if you could escalate the ticket.
this is strange
NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9: work (EU-RO-1)
NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 : no work (EUR-IS-2)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: no work (EUR-IS-1)
NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 : no work (EU-RO-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 : no work (EUR-IS-1)
it doesn't depend on driver version?
seems like EUR-RO-1 uses newer drivers
but it sometimes doesn't work there too
Yes. NVENC reacts sensitively to things like driver version, but even with exactly the same version, in exactly the same DC, and with exactly the same configuration, differences in behavior can be observed.
As you know, EUR-IS-2 is 570.172.08, but there are workers operating normally on 570.172.08, and there are also bad workers on 570.172.08.
The attached image shows information from a worker that operated normally.




Yeah I can confirm
And i didnt know that runpod dashboard shows driver version
I really appreciate you looking into this. I'll have this escalated.
Are you able to share the ids of these deployments or no? It's okay if not I can get it all manually.
8sv5k7ublivjhq
should be the serverless endpoint ID
Thank you very much. Since I would like to track the situation, please escalate from this thread and issue a ticket.
skwe5i0dbvkzeu
This is the serverless endpoint ID that I presented as evidence.
@KaSuTeRaMAX
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #23406
Any news on this issue?
In the support ticket, they reported:
I’ve already escalated this case to our reliability team for deeper review.
It seems they are currently investigating.
I will share any new information here as soon as it becomes available.
Runpod sincerely conducted additional investigation and provided support.
I will share the details of the ticket.
----------------------------------------------------------------------------------------------------------
Following up on your request, we’ve reproduced the NVENC failure you reported across multiple regions and GPU types. After investigation, we’ve classified this as an upstream issue with FFmpeg/NVIDIA. Specifically, the problem appears to stem from how device indices are mapped inside containers (when /dev/nvidia* devices don’t align with nvidia-smi indices).
This behavior matches several active upstream reports:
https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282
Since the root cause lies upstream, a permanent fix will need to come from the FFmpeg/NVIDIA teams. That said, we’ll continue to monitor developments closely and will keep you updated on any relevant progress or workarounds.
----------------------------------------------------------------------------------------------------------I am seeing the same issue @KaSuTeRaMAX. Thank god I found this. I thought I was going crazy.
Seems to be a random roll of the dice on whether a pod will work or not. That being said, I have not seen any of these errors on the serverless endpoints.
I wonder if it is safe to depend on the serverless endpoints for large batches of requests or if this only affects the pods?
I will read the issues mentioned in the runpod support response. Perhaps we can just ship our own binaries that have a fix.
Seeing a lot of talk about the 570 driver being the culprit. When I launch RTX PRO 6000 pods I always get nvidia driver version >= 580 and have yet to see the issue on there. But that could be anecdotal.
It seems there is nothing we can do to fix the problem itself. But I am going to try getting the
/dev/nvidia# and passing that into ffmpeg with ffmpeg -hwaccel_device 0
Maybe that will work. Going to bed but will continue tinkering with it tomorrow.this might support that clain
only servers with driver version 570 didn't work
Yeah as I understand the issue, it's with the way multiple gpu servers start the docker container. Whatever fancy way runpod spins them up, they are probably doing something like
device=1 if the first gpu is already being used in another container.
As I found out last night, you can spin up the exact same image over and over on the same gpu type. Seems like unless you get gpu 0, you get this problem. The thing is, I have not seen this issue even once with RTX PRO 6000 machines.
I'm gonna see right now if I can reproduce the issue, and then work around it by specifying which device to enumerate in ffmpeg
export CUDA_VISIBLE_DEVICES=0; ffmpeg ... seems to have some effect. At least I get a different error when I set different device ids.
Just for fun, I tried symlinking /dev/nvidia1 to /dev/nvidia0 and it had no effect.
I am kind of out of ideas on how to fix this. It feels like if you get a set of circumstances, you just can't work around it.
1. You select a gpu pod that isn't having all the gpus passed to it
2. Someone else is using gpu 0
3. You are on driver ~570 (although I havent confirmed this)
Curious why I never saw this on the serverless endpoints. That's what I indend to use anyway so if it works there, then no big deal. But I can't really finish my project if there's a random chance each serverless endpoint worker might fail too.
Btw. I just launched a pod and got the same exact scenario where I got /dev/nvidia1 but since it is driver 580, it seems to work just fineSo the only workaround is using a dofferent gpu?
Or maybe you can set higher cuda version
To avoid 570 drivers
I guess we can try. But I didn't think they were related. Like if I use a container image that is 12.2, then I get into a pod using an RTX PRO 6000, it will tell me its running cuda 13
I think that filter when you are making a pod is exactly that. Just a filter, showing which gpus are compatible with that version of cuda you've selected. 13 isn't even selectable on there and no matter what I choose, it seems to just have whatever the host system gives you when it is launching the container. Which makes sense.
>Curious why I never saw this on the serverless endpoints.
I just re-tested this issue on the serverless endpoints, and I can confirm that the same problem is still occurring as before.
Oh really 🤔
Maybe because I only really tested on RTX PRO 6000’s
I’ve yet to see one of those gpu’s have this issue. And all have been on driver >= 580
its the host cuda version
when you do nvidia-smi it will print out the cuda version and the driver version
and maybe RTX PRO 6000s don't support lower version of drivers so they don't get 570 driver version
Yeah. It seems to print out whatever version the host uses. So your base image can be like 12.2 and the host can be 13 as reported by nvidia-smi. But the filter selector when creating a pod seems to have no bearing on what version of cuda the host system uses. Like if I select 12.4 I will get 12.2 installed by my image and 13 according to nvidia-smi
that's wierd
I always got what I asked for
but if you set it to 12.8 it should give 12.8+
at least
and that should avoid 570 drivers
There's a chance I don't know what I am talking about. But I am pretty confident that the cuda filter does absolutely nothing besides filter which gpu's are compatible with the version of cuda you need.
Unrelated to that though, I just saw my first failure on the serverless endpoints because I got an RTX 5090 running driver 570.
Weird. I just had an RTX PRO 6000 worker run one of my tasks and it was running driver
570.195.03. It worked
So I must have had gpu #0 I guess.Maybe related to this
https://discord.com/channels/912829806415085598/1427414138576961640
Great find!
can you try setting cuda version
just in case it works
Where at? In my image or in the selector when editing the pod/worker?
here


Probably to cuda 12.8
That selector seems to somewhat help at choosing the host's cuda version. But there are times when I will choose a specific version and get back a different one.
I just tested three pods. Two gave me 12.8 and the last one gave me 13
I think its 12.8+
So you are getting 13 too
Since cuda should be backwards compatible
And newer cuda version == newer drivers
So this should help in getting newer drivers
I think you were right that the selector works like that. I just noticed if I select 12.8 it tells me that the RTX PRO 6000 pods are unavailable, but if I select 12.9 they are available.
So that case is closed
But related to this issue. If I get driver 570 there’s a chance it doesn’t work.
But if I select 12.9 and get driver version 580 with the same gpu, it works.
I guess it’s possible that it’s anecdotal and I’ve just been lucky to get gpu number 0. But I am pretty confident driver 580 doesn’t have this issue
Then selecting 12.9 will make you avoid 570 driver
Because 12.9 is 580+
@Dj is this a bug?
Either this is a bug or 12.8 selector giving cuda13 is a bug
5090
🟥 Driver Version: 570.153.02 CUDA Version: 12.8
🟥 Driver Version: 575.57.08 CUDA Version: 12.9
RTX PRO 6000
🟥 Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia5
✅ Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia0
✅ Driver Version: 570.195.03 CUDA Version: 12.8 /dev/nvidia1 <-- ??
RTX PRO 6000 WK
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2
RTX PRO 6000 WK (no cuda selection)
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2
RTX PRO 6000 WK NORTH AMERICA (no cuda selection)
✅ Driver Version: 580.65.06 CUDA Version: 13.0 /dev/nvidia5
No clue how to get it to give me 580 with cuda 13
It seems random
I tried selecting nothing, which is what I've done in the past to get it to give me 580. It gave me 12.9. So maybe it's just random or a regional thing.
Boom
Selected north america
🇺🇸
Driver Version: 580.65.06 CUDA Version: 13.0
My serverless endpoint workers are all NA except one. But I don't think you can choose where your workers come from.
Oh yes you can. In the advanced section.
Perfect. So if we can get a list of what each regions pods are running for cuda, I could technically go to production with this. maybe
Version 570.86.15(Linux)/572.13(Windows) :: NVIDIA Data Center GPU ...
Release notes for the Release 570 family of NVIDIA® Data Center GPU Drivers for Linux and Windows.
So its cuda 12.x?
Im confused now
I’m confused too 😂
My working theory is that driver 580 seems to not have this problem.
However, it seems to not happen at all on the RTX PRO 6000 WK
But the only way I can get driver 580 has been to use that card. So 🤷
Hiw about Cuda 13?
How*
Does that give you driver 58* consistently?
58X
Every time so far, yes
If I see 580 I see cuda 13
https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html
i think this is the answer
If you look at my table above, some of it is a friggin mystery still.
Like that one time I got the 3rd nvidia card in my container, and it still worked

so if you pick cuda 13
you get nvidia driver >= 580
I wish I could pick cuda 13
Maybe runpod can tell us which regions are running 580/13.0
yeah
But right now if you select north America and don’t pick any cuda version, I have gotten 580/13 100% of the time. But that’s like 15 locations so I definitely haven’t confirmed if all of them have it.
So I will need runpod to confirm so I can filter my serverless endpoint to only select those.
But again, I still don’t know how some of those tests I ran worked. They were not getting the “default” gpu 0 and they were not on 580 and they worked.
What’s real? Is the sky blue? Are birds real?
oh
wait
it works with API
@flexgrip

I was gonna check that. See if I can use the api to specify cuda 13
they just didn't give that option in the UI
You are a beauty
lol
So you just set it to cuda: 13 and it made that?
this

works
Hell yeah
and setting that to 14.0 fails
so they are doing some kind of checking
even if the option is invalid
13.1 also fails
like this

That was you trying to make a cuda 13 instance or an invalid version number?
invalid version number
cuda 13.1 or cuda 14 requested returns that error
cuda 13.0 (you need the .0) works fine
and pods spun up like that have cuda 13.0
This is great. This only leaves some type of confirmation of what is causing this.
Or not what is causing it, but what is causing it to work. I keep wondering… just because I haven’t seen the error on driver 580 doesn’t mean I won’t. I feel like I could just as easily say RTX PRO 6000 WK’s don’t have the issue either.
yeah you have to test if it works in that driver version
I guess I could automate a test for this and just smash the api with it
I guess I’ve been assuming that if I get a pod and see
/dev/nvidia5 that that means I’ve got gpu index 6
But in a few examples above when I was testing earlier I got some successes with /dev/nvidia1 and /dev/nvidia2
so maybe that device number doesn’t mean the index? If so, how could we tell?GitHub
who creates /dev/nvidia0 · NVIDIA open-gpu-kernel-modules · Discu...
Thank you very much for your answer. The problem I encountered is: I can get the PCI device number of NVIDIA graphics card, such as (81:00.0). I want to use this device number to correspond to my l...
Hmm. I wonder if that device minor number is visible from inside the container
I made a script
ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
is this command right?
and it looks like not many machines with 5090s that have cuda 13 is available
maybe I am wrong with the testin
g
if this command is right this should be correct



idk at this point
note that 5090s with cuda 13 is rare
i think there is 1 machine available
That command looks right.
i can't get more than 4 cocurrently
How are you checking for pass/fail?
If ffmpeg returns anything but 0?
this

Conversion failed!
if that's in stderr its a failed run
this si the raw output
Gotcha. Yeah I was greping for
grep -q "No capable devices found"
doesn't this look off
no "No capable devices found"
oh there is

anyways I would like to test with RTX6000 blackwell
but im broke
😢
I've been sitting here trying to patch ffmpeg.

can i ask why you need ffmpeg?
at this point i think using other platforms to transcode is better
lol
I'm trying to use nvenc to convert archives of videos to streamable formats.
I guess I could try gstreamer instead
non-blackwell cards fail too?
I dunno. I haven't tried them much because they're slower and don't have 9th gen nvenc
The newer blackwell have 4 nvenc and seem to blow the L4 gpu's from google cloud run out of the water. At least with this encoding stuff.
Yeah ik
Community vloud seems to work better
Rtx pro 6000 maxqs were all working
gstreamer seemed to work
I haven't thoroughly tested it but I think its slower than ffmpeg
Is it using cpu?
That’s what I’m trying to confirm. I ran an nvenc test on the gst-bad nvenc plugin and it seemed to spit out a video.
Can’t tell if it falls back to cpu
maybe try seeing if nvidia card pulls more power when encoding
or if gstreamer uses more than 100% cpu while encoding
I was deleting a couple pods and accidentally deleted the one I compiled gstreamer on. So now I gotta go through all that mess again.
why not binary releases?
But I will. If gstreamer can do it, then ffmpeg clearly can be patched
I don’t know if there are any precompiled binaries for the gstreamer gst plugins for nvenc. I didn’t really look though
Stack Overflow
How to install gstreamer nvcodec vs nvdec/nvenc plugins on Ubuntu 2...
Installed gstreamer and gstreamer-plugins-bad on ubuntu 20.04 via the apt repo. I also installed the Video_Codec SDK 11.0 from Nvidia.
The gst-ispect command shows me nvenc and nvdec is installed ...
it says you'll get them automatically
try it on runpod pytorch official template
that has ubuntu 22 from what I know
Oh cool
hmm
The gst-plugins-bad with Ubuntu doesn’t have nvenc. I’ll have to compile it again in the morning.
it wont work
gst-inspect-1.0 nvcodec
try this
it does have nvcodec

but no nvh264enc?
Yep
GitHub
env-setup/gst-nvidia-docker at master · jackersson/env-setup
Useful scripts, docker containers. Contribute to jackersson/env-setup development by creating an account on GitHub.
looks like someone made a script
woah huge thread
Great to see you all working on it, but the 12.8 selector should not give you CUDA 13 machines. However you're correct in that a non zero amount of our servers are on 13.0 - but it's not an amount that I can guarantee will always be available.
And naturally we'll do what we can from the backend^ Just a little awkward while we're running maintenance.
All community cloud instances I have tested worked
@Dj This is strange
All instances with RTX PRO 6000
didn't try 5090s yet
Work or don't work?
I think one of the facets of this is the hosts operating system/kernel version. Let me take a look
worked
all community cloud instances with RTX PRO
I wasn't able to spin up much because there was not many available
Every RTX PRO 6000 WK I have tried worked no matter what version or driver or which enumerated gpu I got
@Dj would you like the pod ids?
this is secure cloud
the folder names are pod ids
Interesting, on secure cloud all of these machines use Ubuntu 24.04.2 or 24.04.3.
The unsecure cloud host uses Ubuntu 22.04.5.
So probably not the operating system.
I’ll try the community cloud instances tonight.
There's not a lot and I can't guarantee the availibility.
yeah
i got total under 10 pods
I can tell you we have 1 machine on the community cloud with this GPU, and one machine physically cannot support more than 8 GPUs.
and half of them didn't even work (probably pulling image)
maybe its not one machine but its 1 OS and that usually indicates one machine
it's 2 :)
@riverfog7 What cuda version was the one you were on?
If you know, it could've only been 13.0 or 12.8.
has 12.9

uh?
nvcc --versionis nvidia-smi inaccurate?
I can't ssh back because the test is automated
I learned recently nvidia-smi will show the highest cuda version the driver supports
pod is already terminated
o7 thats fine
Would you happen to have the prompt from the pod?
root@12345678...
Or does the output only give you the result of nvidia-smi?8mmj1nmc6r2ksh
this is the podid
perfect
i don't have the prompt
We have this machine listed as cuda 12.9
Weird when I queried for it it showed as 12.8
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 8mmj1nmc6r2ksh
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mgrd9lo1q1bptd
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mx91o0m3l84i0c
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 uirv051063dg6j
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yh7yf9vtb4o2k1
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yl08tpmbn6esh9
these are the ones i tested
I just opened this, this is excellent actually
I do have some 12.8 ones too
If you do manage to find a correlation let me know, not that you're obligated to and I can very easily create (or run?) a script that simulates a bunch of different variables to pull details. I think we have this chalked down to these issues from the last time we got a report like this:
https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282
OP said gstreamer worked so it might be a bug in ffmpeg
But this issue was reopened, by another? customer just yesterday with the following reproduction and we rolled that up into this too. We discussed a few workarounds, but aren't happy with any of them as they all have their own issues.
We know it's ffmpeg, we just don't know really about the details or why.
https://trac.ffmpeg.org/ticket/11694
only correlation is this?

stupidly blue is failure
red is sucess
Does it help to know that after our maintenance the lowest driver version in the fleet will be 570.195.03?

That’s the driver I have the most issues with 😂
I wish I had escalated privs on one of these nodes so I could test a few things. For example I wonder if this could be fixed by running mknod with the major and minor from /dev/nvidiaX
these are community 5090s
forgot to mention it
I don't have ssh access to the hosts but I do have a lot of other permission.
@riverfog7 I can message you a credit to continue your testing if you'd like.
sure but its midterm season soon so i don't know if i can continue for long 😅
ah i understand
I still wonder why the 6000 WK has worked no matter what version I get. I guess we need to check the index in nvidia-smi to see if it’s just luck
This is the kind of bug I used to love working on when I was at nvidia. Don’t have access to those kind of testing rigs anymore though.
is there any stats that is nice to have when debugging
what is this?

display attatched = True?

Is that the thingy you need to set if you want to do stuff like vnc or X11 forwarding?
i don't know about that part well
this time its 280 instances
of 5090s mixed between community and secure cloud
Is your test easily runable? I don’t mind to burn through credits testing other scenarios
Other gpu’s I should say
its just doing this

in a template







autogluon feature importance

it just identified every gpu by id
idk at this point
this is probably a software bug inside nvidia or ffmpeg
Yeah I read all the tickets and associated links this morning. Nothing seems to be reliable in the reproduction. Some people say you need to be GPU 0, others say the last gpu is working or an odd number in between.
Some say the bug is a regression starting at driver 570. Others have reproduced it on 550 and lower.
Nobody seems to be focused on fixing it. Some ffmpeg references say its an issue with nvcodec itself.
Just run several simultaneous iterations of a quick encoding task on the PRO 6000 WK. Not a single one failed. All gave me 580/13
First test on a PRO 6000, 575/12.9 fail
Had a failure on a serverless worker. I didn't catch it in time to see the logs. But the only difference is that it was not in NA.

I just had an idea to try and do a health check on a serverless endpoint. My question is, once a worker gets my image and goes idle, does it already have this issue or not?
Was hoping I could do a health check and if it fails, the worker terminates and a new one is created until I am left with nothing but workers without this issue.
But when my health check fails, whatever is orchestrating the containers just restarts it instead of terminating and launching on a new worker. Wonder if I can fail with a different error code to get it to terminate?
you can self destruct with this in a pod
runpodctl remove pod ${RUNPOD_POD_ID}
I'm not sure about serverless
they come with pod scoped api keys afaik
so no need to configure credentials
I’ll give it a shot.
I don’t know if my concept is flawed though. Like, if a worker starts the container and doesn’t get the error in ffmpeg, does that mean when a request comes in hours later that it still won’t run into this bug?
I guess I don’t know how the serverless workers are orchestrated.
The question is, is the gpu already assigned when the worker goes idle? And if so, does it stay that way?
@Dj is there an api to terminate serverless workers individually?
its possible in serverless console thingey
If so, this fixes everything for me. Just takes a little longer to deploy
but you have to consider the possibility of getting the same host
you can see in my test that there are multiple overlapping gpu ids
What do you mean?

I think I’m saying the same thing as you
test ffmpeg fails -> terminate worker -> runpod spins up same worker
can happen
Ohh
probably not a problem if there are many GPUs
but since you are using the RTX PRO 6000 and they have limited supply
Well when I manually terminate one I always get back a worker with a different id. But I don’t know if that’s unique or not
I’m using L40, L40S, RTX PRO 6000 and I think one other gpu
the id should be different

pod id is different here
but GPU uuid is same for some pods
I guess I don’t know what a worker truly is. Is it a shared server? Is it a shared cluster? Etc.
Because it could totally just be server rack that picks up requests from the queue and runs
docker run -it … -gpus=5 (not really but you get the idea). That means it could work for one request then fail the next.
Otherwise if it’s consistent, then I’m ok with this janky solution.My opinion is a serverless worker is just a pod
and worker id = pod id
and works with the same infrastructure, hence can share network volumes
That’s what I think too.
Pods + orchestration = serverless worker
ECS in AWS terms
but with an API Gateway
and a queue
and cloudfront
Yep.
So if I can kill a worker during the initial health check, it may be a workable solution. Provided I don’t get the same one over and over 🤔
so this point still stands
this is only a problem when there is like 10 GPUs available and 7 of them are not working
but you are trying to get 5 workers
and you get broke because you get billed for the ffmpeg health check time
Oh I never thought about it billing me for the deploy time
you should get billed for the health check
because the container is already starte
d
That’s a good point. I’ll have to see where I can run this in the lifecycle
I put the health check right before the serverless handler and deployed and noticed some of the instances kept initializing. So I assumed it was running and I just couldn’t see the logs
i think its failing the health check and the host just restarts it
pods do the same thing when containers exit abnormally
host starts it until it works
do you have anything in serverless console-> logs instead of serverless console->workers->worker->logs
Nothing from the deploy. So it’s either not running the health check or none of the logs it generates during a deploy are in those logs I can see
The only thing I changed was adding the health check and the only two outcomes I saw were workers initializing over and over or becoming ready and successfully handling requests.
I just spun up a random serverless endpoint and it looks like "Initializing" is pulling images and extracting them
and "running" is the actual container running
so if worker is created
initialize -> running (load model in memory and health check, etc) -> idle (waits till request)
I only ever get the running state when I send a request. I just get initializing -> idle
oh I think i set this to get the worker count to go up

that makes sense i think im wrong then
my question is "Is anything happening after the container starts and before serverless start billed?"
its not really clear from this explaination

What I thought was happening is deploy > workers are assigned and they all pull your image
Then request > container starts
But after deploy, does it run the container at all.
If so, health check + terminate on fail would work. Otherwise it won’t
I’ve never noticed it charging me for the deploy phases
Hmm. I don’t think this will work now. There is no health check I can find in the docs for queue based serverless
Ugh. I am spending too much time thinking about this each night. I gotta just implement my own queue and let the occasional failures retry. At most, terminate failed workers
Does dockerfile health checks work?
I don’t think the worker is even running the container until you send a request to it.
At least that’s my theory
Hmm
I wonder how Google gets around this
google?
With cloud run gpu instances. We’ve processed lots of video using those and never ran into this problem. They’re on L4 gpus
idk about cloud run but in AWS ECS containers run inside vms
not like runpod (shared host)
Google cloud run is using docker. At least for their second gen runtimes
Maybe that’s the answer. Just run a docker container inside the docker container 😅
good news
serverless workers count as pods ig
so runpodctl remove pod <workerid> works
should work
if the env variable is correct and runpodctl is installed in the pod
Sorry. Wife aggro. I'll check this out.
So with that scenario, the job will fail, but it will take the worker out with it.
Correct. To the API a serverless worker is a pod named after the endpoint id :')
And there's no way during a deploy to tap into any of the health checks?
That should depend on the docker cmd running at worker initialization
Yeah I don’t think the container itself runs. I suspect the pod just fetches the image and makes sure everything is ready for when it receives a request
Ok new tactic...
My API forms the request, sends it to the serverless endpoint, then receives the webhook back on success or fail. On fail, it retries.
Meanwhile, on the worker, if we get the bad cuda ffmpeg response about
no supported devices, we terminate the worker.
Provided there is enough delay, the retry will go to the next worker while the last one was being terminated. So the system should only waste a few seconds on retries and terminating before eventually hitting a successful worker.
This also has the added side effect of constantly pruning bad workers.
So far its working as expectedHmm