Job stuck in queue and workers are sitting idle

This has been the case very often. The jobs are stuck in the queue and workers are idle. How to improve this? There was not anything else going on with any other worker (or endpoint for that matter).
No description
36 Replies
hotsnr
hotsnr9mo ago
+1
kironkeyz
kironkeyz9mo ago
i am ahving problems with setting things up as well we need helpt
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaD9mo ago
It happens to me from time to time too. There are no logs to check because nothing is running on workers. I think it might be something with the orchestrator as the only solution is to cancel the request and send a new one. If you don't, the request is usually executed after a long time. This can randomly happen with a previously perfectly working worker.
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View
Justin
Justin9mo ago
+1 My worker was booted up and was then in idle and the job is sometimes stuck for minutes
jim
jim9mo ago
Same issue. Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening @Justin Merrell @flash-singh
flash-singh
flash-singh9mo ago
this usually mean the worker isn't picking up the job, have endpoint id or anything else to look on our end?
jim
jim9mo ago
TristenHarr
TristenHarr9mo ago
Same issue! We can’t move into production because of this issue. https://discord.com/channels/912829806415085598/1340773964397674709
jim
jim9mo ago
What is runpod doing???
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View
TristenHarr
TristenHarr9mo ago
For me, there are no misconfigurations as far as I can tell. What RunPod is doing is it’ll get a request, a worker will become active, then the job will sit in the queue for 10+ minutes before getting picked up. It’s not flash-boot, (enabled) the logs say everything is ready from a worker perspective. (I’ve checked the logs extensively, it’s not stuck loading a model or anything of that nature. The worker should be ready.) This happens even with multiple workers setup where only 1 will become active then everything sits in the queue for 10 minutes. Once things spin up they seem to work fine, but everytime it’s a new spin up there’s a risk it’ll take 10+ minutes. What I’ve been doing is increasing the time before it spins down and then trying to find a “good” worker and keep it open as long as I can, even sending redundant requests just to prevent getting a “bad” or “stuck” worker. It’s also intermittent/flaky, sometimes it will spin up quick and work fine, sometimes it gets stuck like this. It’s not something that’s happening every time. I’d say maybe 10-30% of the time.
Unknown User
Unknown User9mo ago
Message Not Public
Sign In & Join Server To View
TristenHarr
TristenHarr9mo ago
For sure, I think this is a known issue they are looking into! Will give them some time to dig in and if I still have problems later this week I'll follow up. 🙂
blue whale
blue whaleOP9mo ago
I can share but this has been intermittently. We are planning to roll out a production ready application but this has been making us infuriating
deanQ
deanQ9mo ago
Is anyone still experiencing this today? Please report and indicate an endpoint or worker ID. Thanks.
getsomedata
getsomedata9mo ago
Yep I had 58 workers idle (says ready on the logs) and only 10 workers running. My queue had 58 jobs for ove 30 mins, I killed it and tried many things before joining discord and seeing others have the same problem. I am recreating a new endpoint and will share endpoint ID.
getsomedata
getsomedata9mo ago
I have set 78 max workers and 78 active workers but still have only 7 workers running. Idle node log: 2/20/2025, 3:15:41 PM loading container image from cache Loaded image: xxxxxxx xxx Pulling from xxxx Digest: xxxxxx Status: Image is up to date for xxxxxx worker is ready
No description
No description
Dj
Dj9mo ago
@getsomedata Can you give me your endpoint id? We're looking into this, thank you for waiting!
blue whale
blue whaleOP9mo ago
Would love to know the findings. Dont want to give up on runpod
Dj
Dj9mo ago
@blue whale I'm told this incident should be resolved for most users, can you share an endpoint ID if you're still seeing this problem?
Twiix
Twiix8mo ago
same issue here no idea where to go from this point anyone solved this issue please share
No description
andypotato
andypotato8mo ago
This is probably the same issue as I have described in a separate report https://discord.com/channels/912829806415085598/1345960498478321735 - Jobs are simply never executed despite the worker running and even other workers being available - Querying the job status will make it run immediately - Only happens with jobs started via run but not runsync - Exact same behavior in local testing environment
david
david8mo ago
i have seen this issue in my own testing too, some jobs seem to just be forgotten and active workers turn off without picking it up
Unknown User
Unknown User8mo ago
Message Not Public
Sign In & Join Server To View
Twiix
Twiix8mo ago
h5d68prihmval0 please help! @yhlong00000 @nerdylive
Unknown User
Unknown User8mo ago
Message Not Public
Sign In & Join Server To View
amirh1541
amirh15418mo ago
Here also
yhlong00000
yhlong000008mo ago
You might want to check your code or Docker image, it looks like it’s not able to start properly and becomes inactive after some time.
slavov.tech | vidfast.ai
Same issue with the serverless fasterwhisper template
derekmaurer
derekmaurer6mo ago
I believe I'm having a similar issue... It seems that the job queue gets stuck and there's nothing I can do about it. The purge function isn't actually purging the jobs. The UI at the top of the screen shows 1 job, but logs show 7 jobs. Anyone run into something like this?
No description
derekmaurer
derekmaurer6mo ago
No description
derekmaurer
derekmaurer6mo ago
Looks like there's a bug that's causing an issue with connecting to their job queue pool
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
derekmaurer
derekmaurer6mo ago
@Jason Sure!

Did you find this page helpful?