18 Replies
@flash-singh I can't use runpod, for that strange issue. I have the same docker image, but built a near a month ago. It works perfectly.
all the jobs get stuck or just that one?
all jobs in that queue
Looks like the first job from cold worker start always fine, 2+ more that 50% chance to be stucked
ping me endpoint id
the endpoint is bad, or using a bad sdk, can you make sure its updated
i can see the jobs being taken from queue but not being reported back as soon as job is taken
uucgkak7h76hfd
SDK the latest versionI create a new endpoint with the same docker image. Problem almost the same
p6j8tqfojfhmll
I have older docker image, used in production. All works good. Version of that SDK is 1.7.2
My latest works on 1.7.4 and have that problems
I have this pip diff https://www.diffchecker.com/eYoE7Gm2/
Where runpod 1.7.2 is the older working good docker image.
Diffchecker - Compare text online to find the difference between tw...
Diffchecker will compare text to find the difference between two text files. Just paste your files and click Find Difference!
So i can confirm that 1.7.4 contains some bugs around it
1.7.2 works well without any issue
https://github.com/runpod/runpod-python/releases/tag/1.7.5
we are testing this pre-release to confirm the issue is with data race between local state mangement in python pkg
GitHub
Release 1.7.5 · runpod/runpod-python
What's Changed
Fix: failed requests due to race conditions in the job queue vs job progress by @deanq in #376
Full Changelog: 1.7.4...1.7.5
Hello @flash-singh
I'm struggling with the same stuck IN_PROGRESS problem for a while now, and jumped to the opportunity to try out the 1.7.5 SDK.
Unfortunately the issue persist, got stuck 2 times out of 10 identical jobs. (I'm also using runpod.serverless.progress_update in this case, but removing them from my handler didnt change this behaviour )
Endpoint : noxhy2en39n3y3
Workers: 07lhjaqe6si5cj, gy95pb525iwdot
Let me know if i can provide any logs code to help out!
can you turn on debug log level and share your endpoint id?
@Mihály
@flash-singh
noxhy2en39n3y3
ty let me know if you see this issue, ill look at logs
I havent submitted any jobs after the last 10 i have mentioned above, and the debug ENV is already there for weeks now. But ill submit some more if you'd like! @flash-singh
so what your seeing is some jobs stay in progress forever? you have to manually cancel or remove it?
so far im seeing thing are normal, the request id you pointed out
48612f2e-dc1b-4128-8e77-870c1902e4eb-e1
was never marked as completed?Yes, it usuall stays as in_progress, until it becomes 404. I tried the webhook instead of polling, but that also never arrives these cases
@flash-singh
lets take this over DMs
for anyone else following, we got to root of the cause, there is a data race that can occur when using serverless progress feature, where you send progress updates within job. this feature can conflict when the job is done and override the completed status, will plan a fix for this within next 2 weeks