Why it can be stucked IN_PROGRESS?

No description
18 Replies
rougsig
rougsigOP3w ago
@flash-singh I can't use runpod, for that strange issue. I have the same docker image, but built a near a month ago. It works perfectly.
flash-singh
flash-singh3w ago
all the jobs get stuck or just that one?
rougsig
rougsigOP3w ago
all jobs in that queue Looks like the first job from cold worker start always fine, 2+ more that 50% chance to be stucked
flash-singh
flash-singh3w ago
ping me endpoint id the endpoint is bad, or using a bad sdk, can you make sure its updated i can see the jobs being taken from queue but not being reported back as soon as job is taken
rougsig
rougsigOP3w ago
uucgkak7h76hfd SDK the latest version
rougsig
rougsigOP3w ago
I create a new endpoint with the same docker image. Problem almost the same p6j8tqfojfhmll
No description
rougsig
rougsigOP3w ago
I have older docker image, used in production. All works good. Version of that SDK is 1.7.2 My latest works on 1.7.4 and have that problems
rougsig
rougsigOP3w ago
I have this pip diff https://www.diffchecker.com/eYoE7Gm2/ Where runpod 1.7.2 is the older working good docker image.
Diffchecker - Compare text online to find the difference between tw...
Diffchecker will compare text to find the difference between two text files. Just paste your files and click Find Difference!
rougsig
rougsigOP3w ago
So i can confirm that 1.7.4 contains some bugs around it 1.7.2 works well without any issue
flash-singh
flash-singh3w ago
https://github.com/runpod/runpod-python/releases/tag/1.7.5 we are testing this pre-release to confirm the issue is with data race between local state mangement in python pkg
GitHub
Release 1.7.5 · runpod/runpod-python
What's Changed Fix: failed requests due to race conditions in the job queue vs job progress by @deanq in #376 Full Changelog: 1.7.4...1.7.5
Mihály
Mihály3w ago
Hello @flash-singh I'm struggling with the same stuck IN_PROGRESS problem for a while now, and jumped to the opportunity to try out the 1.7.5 SDK. Unfortunately the issue persist, got stuck 2 times out of 10 identical jobs. (I'm also using runpod.serverless.progress_update in this case, but removing them from my handler didnt change this behaviour ) Endpoint : noxhy2en39n3y3 Workers: 07lhjaqe6si5cj, gy95pb525iwdot Let me know if i can provide any logs code to help out!
No description
flash-singh
flash-singh3w ago
can you turn on debug log level and share your endpoint id? @Mihály
Mihály
Mihály3w ago
@flash-singh noxhy2en39n3y3
No description
flash-singh
flash-singh3w ago
ty let me know if you see this issue, ill look at logs
Mihály
Mihály3w ago
I havent submitted any jobs after the last 10 i have mentioned above, and the debug ENV is already there for weeks now. But ill submit some more if you'd like! @flash-singh
flash-singh
flash-singh3w ago
so what your seeing is some jobs stay in progress forever? you have to manually cancel or remove it? so far im seeing thing are normal, the request id you pointed out 48612f2e-dc1b-4128-8e77-870c1902e4eb-e1 was never marked as completed?
Mihály
Mihály3w ago
Yes, it usuall stays as in_progress, until it becomes 404. I tried the webhook instead of polling, but that also never arrives these cases @flash-singh
flash-singh
flash-singh3w ago
lets take this over DMs for anyone else following, we got to root of the cause, there is a data race that can occur when using serverless progress feature, where you send progress updates within job. this feature can conflict when the job is done and override the completed status, will plan a fix for this within next 2 weeks
Want results from more Discord servers?
Add your server