RunPod•6mo ago

Why it can be stucked IN_PROGRESS?

20 Replies

rougsigOP•6mo ago

@flash-singh I can't use runpod, for that strange issue. I have the same docker image, but built a near a month ago. It works perfectly.

flash-singh•6mo ago

all the jobs get stuck or just that one?

rougsigOP•6mo ago

all jobs in that queue Looks like the first job from cold worker start always fine, 2+ more that 50% chance to be stucked

flash-singh•6mo ago

ping me endpoint id the endpoint is bad, or using a bad sdk, can you make sure its updated i can see the jobs being taken from queue but not being reported back as soon as job is taken

rougsigOP•6mo ago

uucgkak7h76hfd SDK the latest version

rougsigOP•6mo ago

I create a new endpoint with the same docker image. Problem almost the same p6j8tqfojfhmll

rougsigOP•6mo ago

I have older docker image, used in production. All works good. Version of that SDK is 1.7.2 My latest works on 1.7.4 and have that problems

rougsigOP•6mo ago

I have this pip diff https://www.diffchecker.com/eYoE7Gm2/ Where runpod 1.7.2 is the older working good docker image.

Diffchecker - Compare text online to find the difference between tw...

Diffchecker will compare text to find the difference between two text files. Just paste your files and click Find Difference!

rougsigOP•6mo ago

So i can confirm that 1.7.4 contains some bugs around it 1.7.2 works well without any issue

flash-singh•6mo ago

https://github.com/runpod/runpod-python/releases/tag/1.7.5 we are testing this pre-release to confirm the issue is with data race between local state mangement in python pkg

GitHub

Release 1.7.5 · runpod/runpod-python

What's Changed Fix: failed requests due to race conditions in the job queue vs job progress by @deanq in #376 Full Changelog: 1.7.4...1.7.5

Mihály•6mo ago

Hello @flash-singh I'm struggling with the same stuck IN_PROGRESS problem for a while now, and jumped to the opportunity to try out the 1.7.5 SDK. Unfortunately the issue persist, got stuck 2 times out of 10 identical jobs. (I'm also using runpod.serverless.progress_update in this case, but removing them from my handler didnt change this behaviour ) Endpoint : noxhy2en39n3y3 Workers: 07lhjaqe6si5cj, gy95pb525iwdot Let me know if i can provide any logs code to help out!

flash-singh•6mo ago

can you turn on debug log level and share your endpoint id? @Mihály

Mihály•6mo ago

@flash-singh noxhy2en39n3y3

flash-singh•6mo ago

ty let me know if you see this issue, ill look at logs

Mihály•6mo ago

I havent submitted any jobs after the last 10 i have mentioned above, and the debug ENV is already there for weeks now. But ill submit some more if you'd like! @flash-singh

flash-singh•6mo ago

so what your seeing is some jobs stay in progress forever? you have to manually cancel or remove it? so far im seeing thing are normal, the request id you pointed out 48612f2e-dc1b-4128-8e77-870c1902e4eb-e1 was never marked as completed?

Mihály•6mo ago

Yes, it usuall stays as in_progress, until it becomes 404. I tried the webhook instead of polling, but that also never arrives these cases @flash-singh

flash-singh•6mo ago

lets take this over DMs for anyone else following, we got to root of the cause, there is a data race that can occur when using serverless progress feature, where you send progress updates within job. this feature can conflict when the job is done and override the completed status, will plan a fix for this within next 2 weeks

MiroFashion•2mo ago

was it fixed? I was battling with this whole day till I found this post. stays IN_PROGRESS even after completion and not send the webhook event of complete

Poddy•2mo ago

@rougsig

Escalated To Zendesk

The thread has been escalated to Zendesk!

Gaming

Programming

Why it can be stucked IN_PROGRESS?

Did you find this page helpful?