R
Runpod4mo ago
Barış

Running Workers Not Shutting Down & Incurring Charges

Hi, we're facing a critical issue with workers not shutting down when there's nothing in queue/progress, which is causing significant over-billing and blocking our app launch. Reporting this after it happened at least 3 times. I've observed that after all jobs are processed (finished/cancelled and nothing in queue), workers continue running for over 8 minutes doing nothing. I noticed it happening with both scaling settings: - Queue Delay: A worker ran for 8+ minutes with an empty queue (attached a video of this below) - Request Count: Two separate workers ran for 8+ minutes after the last job was processed (I sent these messages when it happened: https://discord.com/channels/912829806415085598/948767517332107274/1388527617510084651 https://discord.com/channels/912829806415085598/948767517332107274/1388531493768527932) This costed me another $10 in credits over just two days. Just the two examples I shared above are examples where the workers run for 24 minutes (1 worker running over 8 mins on Queue Delay + 2 workers running over 8 mins on Request Count = 3 workers of RTX4090 GPU 24 GB (PRO) running over 24 mins) and charged us when it wasn't doing anything. I've spent $100 in just two months on testing alone, and issues like these are preventing me from launching our app since we can't rely on the platform scaling to function properly as we will launch the app in a server with over 10K members. Everything on our app has been ready for two months, we have to launch as soon as possible when serverless endpoints work properly (please also see the other issue we have https://discord.com/channels/912829806415085598/1375136211395547246). I'd really appreciate if you could help with this. I can share the logs with a DM if needed. Thank you for your time!
24 Replies
Barış
BarışOP4mo ago
deanQ
deanQ4mo ago
Logs would have been helpful in this video. So much time spent on everything but the most important part. I was waiting for you to click on Logs in that worker detail view. What was going on there?
Barış
BarışOP4mo ago
Thank you for checking Dean! These are the logs from when the video was recorded:
Dj
Dj4mo ago
The pod with id 64yksjmwg97wvk just failed to start, all I see is a reload loop :thinkMan:
Barış
BarışOP4mo ago
Yeah lol. Isn't it an issue that it kept running when there wasn't anything in queue?
Dj
Dj4mo ago
Sort of, we're still allocating you a GPU even if your pod only exists for like 300ms (the average time I see just skimming the log here). The content of the error:
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6, please update your driver to a newer version, or use an earlier cuda container: unknown
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6, please update your driver to a newer version, or use an earlier cuda container: unknown
You can set a minimum CUDA version in the endpoint settings And I can issue you however much you lost in GPU time - I'd just need a little longer to get it figured out
Barış
BarışOP4mo ago
Sending a new log file because what I sent was wrong... Looks like the log timestamps on RunPod website vs. the file I download from there show timestamps in different timezones, which made me send you logs from different hours
Barış
BarışOP4mo ago
Here are the logs from when I recorded this
Barış
BarışOP4mo ago
I included an extra line before and after when this happened just so you have all the logs Thank you guys for checking 🙏
Dj
Dj4mo ago
The container you show in the videos is 64yksjmwg97wvk which is different from the 6w1168q7pwltxm pod who's logs you saw in the video, the 64yks pod never started 👀
Barış
BarışOP4mo ago
Right? 😅 It would make a bit more sense if the 6w116 was the worker that had the issue, because it was the one used for running the requests. A bit strange the 64ykw worker tried to start for 8 mins
Dj
Dj4mo ago
If you change your endpoint settings to only allow your Pod to be started on CUDA 12.6 or higher you won't have the issue again. I added a little to your account to remedy it, but I don't think anything too unusual happened. I'll see what I can do to get us a limit on how many times we let a specific worker fail to start?
Barış
BarışOP4mo ago
Thank you so much, DJ! I’ve set it to CUDA 12.7 now, will let you know if I notice it happening again
Barış
BarışOP4mo ago
This was the second time we run out of credits in two days in June, so I wanted to report here. The previous time was also weird like this, we hadn’t even generated 20 images if I remember correctly. I think it would be helpful to be able to see request history, just like how we see when there’s a recent request (which disappear shortly)
No description
Barış
BarışOP4mo ago
Also would be really helpful to be able to see how much each generation approximately costs. It costs per usage (running workers) but seeing around how much it is per generation would be great Just some data that could help users and the RunPod team to notice odd usage if a similar issue happens again Final feedback I’d like to share related to this thread is to maybe enabling CUDA 12.6 or higher as default endpoint settings. Thanks again for your help DJ!
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Barış
BarışOP4mo ago
wanted to check here to see if others also experienced it before creating a ticket. thanks to the help I got here, the issue has been fixed after selecting CUDA 12.6 or higher versions 🙌 marking it as solved
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Barış
BarışOP4mo ago
yeah, it can be a bit confusing when users encounter CUDA-related issues for the first time (as it happened to us). maybe in the future, serverless endpoints can automatically detect/sync CUDA version from the image they are using
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Barış
BarışOP4mo ago
thank you! I shared it with someone at Runpod and heard he created an internal ticket about it :poddy:
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Barış
BarışOP4mo ago
yes 😅 I don't know if I was allowed to say his name but he's been super helpful
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?