R
RunPod15mo ago
zero

Active worker keeps downloading images and Im being charged for it

why is it that a worker will finish downloading, extracting, and initializing--then get into a 'worker is ready' state to only go back to downloading when it receives a job? Its just wasting credits at this point...and fairly frustrating.
No description
57 Replies
J.
J.15mo ago
can u share ur template screenshot? if ur okay with that?
zero
zeroOP15mo ago
No description
zero
zeroOP15mo ago
its a standard template as far as im aware. The image itself is ~30GB
J.
J.15mo ago
Ah ok nvm guess is private then. does ur container image have a tag? just to confirm? And is the platform on dockerhub indicating amd64? like should be username/image:1.0 or something ur container disk can be like 50gb then too
zero
zeroOP15mo ago
yeah its on dockerhub. Tbh Im not too concerned about it being public: teamclashofficial/dot-diff-comfy:latest
J.
J.15mo ago
hm
zero
zeroOP15mo ago
just wondering why it has to redownload every other moment. For example, one that was just idle for a minute had to do a complete redownload. Would a network drive be more suitable depsite the less available GPUs?
J.
J.15mo ago
It doesnt, this might be a runpod issue let me check my stuff Refreshing ur page doesnt show they are idle? let me try to deploy ur stuff so i can see too
zero
zeroOP15mo ago
yeah when I refresh at times, it'll show like 2 are idle--then when I push a job to it, I'll check and they'll be downloading in an active state k
J.
J.15mo ago
Maybe try to delete the endpoint and remake it? :/ ive seen the active downloading thing before myself
zero
zeroOP15mo ago
I'll try now
J.
J.15mo ago
yea im deploying my own endpoint again template to see if it also ur template vs a wide issue will lyk
ashleyk
ashleyk15mo ago
Seems like a bug, only max workers should redownload the docker image, not active workers.
J.
J.15mo ago
Do u mind to share ur dockerfile? are u calling the python handler? i can replicate it
zero
zeroOP15mo ago
just deleted and deployed a new endpoint. Here's the id so you can monitor: 07o7r7hqcd22z1 it's set to 1 active and 3 workers
J.
J.15mo ago
nah no need set to 0,0 this might be a flash thing to ask
zero
zeroOP15mo ago
J.
J.15mo ago
testing my own templates now to see if its rhe same
zero
zeroOP15mo ago
k
J.
J.15mo ago
is ur start.sh calling python handler? or what is it calling
zero
zeroOP15mo ago
its calling a custom server.py file that has the runpod handler
J.
J.15mo ago
Start the handler only if this script is run directly if name == "main": runpod.serverless.start({"handler": handler}) try to get rid of this if check i think this is causing a bug
zero
zeroOP15mo ago
k
J.
J.15mo ago
where runpod isnt catching ur handler urgh runpod error checking rlly sucks i wish there better error debugging
zero
zeroOP15mo ago
lol true. This may take a while as I'll have to rebuild
J.
J.15mo ago
no worries why dont u try a fake one for now
ashleyk
ashleyk15mo ago
Its not, this is normal in Python, I do it all the time and never had an issue like this.
J.
J.15mo ago
hm
J.
J.15mo ago
RunPod Blog
Serverless | Create a Custom Basic API
RunPod's Serverless platform allows for the creation of API endpoints that automatically scale to meet demand. The tutorial guides you through creating a basic worker and turning it into an API endpoint on the RunPod serverless platform. For this tutorial, we will create an API endpoint that helps us accomplish
J.
J.15mo ago
Got it just my guess also maybe use ur built image as the base and just copy ur handler.py over it Maybe flash can help then
ashleyk
ashleyk15mo ago
Its most likely just a bug with the serverless handling of active workers and treating them like max workers, there is nothing wrong with the code, image etc. Best for @flash-singh to advise, he already asked for endpoint id in #🎤|general
J.
J.15mo ago
Got it, will leave to @flash-singh , and i guess share ur current endpoint @black_zero6641
zero
zeroOP15mo ago
forwarded the id to him in general
J.
J.15mo ago
I think keep to 0-0 so ur not burning cash
zero
zeroOP15mo ago
thank you for the help. Its much appreciated
J.
J.15mo ago
i do find it weird that it’s replicable with ur image tho / not my others ones which is why i thought maybe something inherent to the image
zero
zeroOP15mo ago
yeah I thought I was going crazy for a sec lmao
ashleyk
ashleyk15mo ago
Also avoid using latest as tag, its best practice to use a version tag, but thats most likely not the cause of the issue.
zero
zeroOP15mo ago
it may end up being related to the image if its not happening to others. will do
ashleyk
ashleyk15mo ago
By the way this is also not the correct way of handling errors:
return {
"status": "error",
"message": f"An error occurred while processing the job: {e}",
}
return {
"status": "error",
"message": f"An error occurred while processing the job: {e}",
}
J.
J.15mo ago
justinwlin/runpodwhisperx:1.4 https://github.com/justinwlin/runpodWhisperx
GitHub
GitHub - justinwlin/runpodWhisperx: Runpod WhisperX Docker Containe...
Runpod WhisperX Docker Container Repo. Contribute to justinwlin/runpodWhisperx development by creating an account on GitHub.
J.
J.15mo ago
No description
J.
J.15mo ago
an ex my template
ashleyk
ashleyk15mo ago
Correct way of handling errors and causing the job to fail: If the error is a string:
return {
"error": f"An error occurred while processing the job: {e}"
}
return {
"error": f"An error occurred while processing the job: {e}"
}
If its a list or dict:
return {
"error": f"Some error message",
"output": someDict | someList
}
return {
"error": f"Some error message",
"output": someDict | someList
}
Its important to note that error key can only handle string and not list or dict.
J.
J.15mo ago
No description
J.
J.15mo ago
a another one
J.
J.15mo ago
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
zero
zeroOP15mo ago
got it. I'll do some quick reformatting to the file. Im guessing this is for clearer error logging on the runpod side?
J.
J.15mo ago
(in case u wanted reference)
ashleyk
ashleyk15mo ago
By the way for version numbers, I recommend semantic versioning, not arb version numbers: https://semver.org/
Semantic Versioning
Semantic Versioning 2.0.0
Semantic Versioning spec and website
zero
zeroOP15mo ago
thanks! One question I have though is if it would be better to attempt to split the image up into specific smaller domains for faster startup time (I think Im limited to 5)? Im not sure if runpod caches the images to avoid the downloading issue. got it. Honestly I should have just asked questions here sooner. Would have caused less headaches 😂
J.
J.15mo ago
They are supposed to cache my workers dont refresh i recommend always have 2 max workers minimum, preferably three, and runpod will spin up 5 idles for u (maybe 1-2 throttled) but gives u more workers to get to download ur image and work, they still honor the max workers at any given time tho But i do think if u wanna sanity check urself this tutorial is a good sanity check if ur getting diverging behavior, especially cause it so consistent on ur image i feel something is wrong but i honestly cannot fathom a guess anyways hopefully Flash can help out
ashleyk
ashleyk15mo ago
Sounds like the issue is due to pushing a new release to the latest tag.
zero
zeroOP15mo ago
thats what Im going through now. Also just specified a tag and removed latest the pod is downloading the new tag now, so I should be able to confirm in a few minutes I think you called it. Did more tests and now Idle/Initializing pods go right to startup instead of downloading. You're the 🐐
J.
J.15mo ago
Guess that answers my never answered question before too xD https://discord.com/channels/912829806415085598/1208257003131113502 👁️
ashleyk
ashleyk15mo ago
Not me, thank @flash-singh , he nailed it.
J.
J.15mo ago
Wonder how come i was getting an infinite download too tho interesting weird weird but as long it working now

Did you find this page helpful?