R
Runpod11mo ago
rougsig

Why it can be stucked IN_PROGRESS?

No description
21 Replies
rougsig
rougsigOP11mo ago
@flash-singh I can't use runpod, for that strange issue. I have the same docker image, but built a near a month ago. It works perfectly.
flash-singh
flash-singh11mo ago
all the jobs get stuck or just that one?
rougsig
rougsigOP11mo ago
all jobs in that queue Looks like the first job from cold worker start always fine, 2+ more that 50% chance to be stucked
flash-singh
flash-singh11mo ago
ping me endpoint id the endpoint is bad, or using a bad sdk, can you make sure its updated i can see the jobs being taken from queue but not being reported back as soon as job is taken
rougsig
rougsigOP11mo ago
uucgkak7h76hfd SDK the latest version
rougsig
rougsigOP11mo ago
I create a new endpoint with the same docker image. Problem almost the same p6j8tqfojfhmll
No description
rougsig
rougsigOP11mo ago
I have older docker image, used in production. All works good. Version of that SDK is 1.7.2 My latest works on 1.7.4 and have that problems
rougsig
rougsigOP11mo ago
I have this pip diff https://www.diffchecker.com/eYoE7Gm2/ Where runpod 1.7.2 is the older working good docker image.
Diffchecker - Compare text online to find the difference between tw...
Diffchecker will compare text to find the difference between two text files. Just paste your files and click Find Difference!
rougsig
rougsigOP11mo ago
So i can confirm that 1.7.4 contains some bugs around it 1.7.2 works well without any issue
flash-singh
flash-singh11mo ago
https://github.com/runpod/runpod-python/releases/tag/1.7.5 we are testing this pre-release to confirm the issue is with data race between local state mangement in python pkg
GitHub
Release 1.7.5 · runpod/runpod-python
What's Changed Fix: failed requests due to race conditions in the job queue vs job progress by @deanq in #376 Full Changelog: 1.7.4...1.7.5
Mihály
Mihály11mo ago
Hello @flash-singh I'm struggling with the same stuck IN_PROGRESS problem for a while now, and jumped to the opportunity to try out the 1.7.5 SDK. Unfortunately the issue persist, got stuck 2 times out of 10 identical jobs. (I'm also using runpod.serverless.progress_update in this case, but removing them from my handler didnt change this behaviour ) Endpoint : noxhy2en39n3y3 Workers: 07lhjaqe6si5cj, gy95pb525iwdot Let me know if i can provide any logs code to help out!
No description
flash-singh
flash-singh11mo ago
can you turn on debug log level and share your endpoint id? @Mihály
Mihály
Mihály11mo ago
@flash-singh noxhy2en39n3y3
No description
flash-singh
flash-singh11mo ago
ty let me know if you see this issue, ill look at logs
Mihály
Mihály11mo ago
I havent submitted any jobs after the last 10 i have mentioned above, and the debug ENV is already there for weeks now. But ill submit some more if you'd like! @flash-singh
flash-singh
flash-singh11mo ago
so what your seeing is some jobs stay in progress forever? you have to manually cancel or remove it? so far im seeing thing are normal, the request id you pointed out 48612f2e-dc1b-4128-8e77-870c1902e4eb-e1 was never marked as completed?
Mihály
Mihály11mo ago
Yes, it usuall stays as in_progress, until it becomes 404. I tried the webhook instead of polling, but that also never arrives these cases @flash-singh
flash-singh
flash-singh11mo ago
lets take this over DMs for anyone else following, we got to root of the cause, there is a data race that can occur when using serverless progress feature, where you send progress updates within job. this feature can conflict when the job is done and override the completed status, will plan a fix for this within next 2 weeks
MiroFashion
MiroFashion7mo ago
was it fixed? I was battling with this whole day till I found this post. stays IN_PROGRESS even after completion and not send the webhook event of complete
Poddy
Poddy7mo ago
@rougsig
Escalated To Zendesk
The thread has been escalated to Zendesk!
Lil Psycho Panda
I'm experiencing the same/similar issue as well (runpod version '1.7.13') when running multiple jobs on the same worker. First job runs just fine, however the other jobs are IN_PROGRESS. What's weird is that if there are total of 2 jobs, the logging shows as 1 IN_QUEUE and 2 IN_PROGRESS. This is a shown in the logs that I got from the worker:
2025-09-05T23:17:18.099507948Z {"requestId": null, "message": "JobScaler.status | concurrency: 4; queue: 1; progress: 2", "level": "DEBUG"}
2025-09-05T23:17:18.099507948Z {"requestId": null, "message": "JobScaler.status | concurrency: 4; queue: 1; progress: 2", "level": "DEBUG"}
You can see that I have concurrency up to 4, but only 1 of the jobs is actually running. The second job is just stuck in progress and it doesn't start until the 1st job is completed. Jobs are async of course. My endpoint ID is h3y9xfk70snov7 Thanks in advance!

Did you find this page helpful?