Active worker keeps downloading images and Im being charged for it

why is it that a worker will finish downloading, extracting, and initializing--then get into a 'worker is ready' state to only go back to downloading when it receives a job? Its just wasting credits at this point...and fairly frustrating.
No description
57 Replies
justin
justin4mo ago
can u share ur template screenshot? if ur okay with that?
black_zero
black_zero4mo ago
No description
black_zero
black_zero4mo ago
its a standard template as far as im aware. The image itself is ~30GB
justin
justin4mo ago
Ah ok nvm guess is private then. does ur container image have a tag? just to confirm? And is the platform on dockerhub indicating amd64? like should be username/image:1.0 or something ur container disk can be like 50gb then too
black_zero
black_zero4mo ago
yeah its on dockerhub. Tbh Im not too concerned about it being public: teamclashofficial/dot-diff-comfy:latest
justin
justin4mo ago
hm
black_zero
black_zero4mo ago
just wondering why it has to redownload every other moment. For example, one that was just idle for a minute had to do a complete redownload. Would a network drive be more suitable depsite the less available GPUs?
justin
justin4mo ago
It doesnt, this might be a runpod issue let me check my stuff Refreshing ur page doesnt show they are idle? let me try to deploy ur stuff so i can see too
black_zero
black_zero4mo ago
yeah when I refresh at times, it'll show like 2 are idle--then when I push a job to it, I'll check and they'll be downloading in an active state k
justin
justin4mo ago
Maybe try to delete the endpoint and remake it? :/ ive seen the active downloading thing before myself
black_zero
black_zero4mo ago
I'll try now
justin
justin4mo ago
yea im deploying my own endpoint again template to see if it also ur template vs a wide issue will lyk
ashleyk
ashleyk4mo ago
Seems like a bug, only max workers should redownload the docker image, not active workers.
justin
justin4mo ago
Do u mind to share ur dockerfile? are u calling the python handler? i can replicate it
black_zero
black_zero4mo ago
just deleted and deployed a new endpoint. Here's the id so you can monitor: 07o7r7hqcd22z1 it's set to 1 active and 3 workers
justin
justin4mo ago
nah no need set to 0,0 this might be a flash thing to ask
black_zero
black_zero4mo ago
justin
justin4mo ago
testing my own templates now to see if its rhe same
black_zero
black_zero4mo ago
k
justin
justin4mo ago
is ur start.sh calling python handler? or what is it calling
black_zero
black_zero4mo ago
its calling a custom server.py file that has the runpod handler
justin
justin4mo ago
Start the handler only if this script is run directly if name == "main": runpod.serverless.start({"handler": handler}) try to get rid of this if check i think this is causing a bug
black_zero
black_zero4mo ago
k
justin
justin4mo ago
where runpod isnt catching ur handler urgh runpod error checking rlly sucks i wish there better error debugging
black_zero
black_zero4mo ago
lol true. This may take a while as I'll have to rebuild
justin
justin4mo ago
no worries why dont u try a fake one for now
ashleyk
ashleyk4mo ago
Its not, this is normal in Python, I do it all the time and never had an issue like this.
justin
justin4mo ago
hm
justin
justin4mo ago
RunPod Blog
Serverless | Create a Custom Basic API
RunPod's Serverless platform allows for the creation of API endpoints that automatically scale to meet demand. The tutorial guides you through creating a basic worker and turning it into an API endpoint on the RunPod serverless platform. For this tutorial, we will create an API endpoint that helps us accomplish
justin
justin4mo ago
Got it just my guess also maybe use ur built image as the base and just copy ur handler.py over it Maybe flash can help then
ashleyk
ashleyk4mo ago
Its most likely just a bug with the serverless handling of active workers and treating them like max workers, there is nothing wrong with the code, image etc. Best for @flash-singh to advise, he already asked for endpoint id in #🎤|general
justin
justin4mo ago
Got it, will leave to @flash-singh , and i guess share ur current endpoint @black_zero6641
black_zero
black_zero4mo ago
forwarded the id to him in general
justin
justin4mo ago
I think keep to 0-0 so ur not burning cash
black_zero
black_zero4mo ago
thank you for the help. Its much appreciated
justin
justin4mo ago
i do find it weird that it’s replicable with ur image tho / not my others ones which is why i thought maybe something inherent to the image
black_zero
black_zero4mo ago
yeah I thought I was going crazy for a sec lmao
ashleyk
ashleyk4mo ago
Also avoid using latest as tag, its best practice to use a version tag, but thats most likely not the cause of the issue.
black_zero
black_zero4mo ago
it may end up being related to the image if its not happening to others. will do
ashleyk
ashleyk4mo ago
By the way this is also not the correct way of handling errors:
return {
"status": "error",
"message": f"An error occurred while processing the job: {e}",
}
return {
"status": "error",
"message": f"An error occurred while processing the job: {e}",
}
justin
justin4mo ago
justinwlin/runpodwhisperx:1.4 https://github.com/justinwlin/runpodWhisperx
GitHub
GitHub - justinwlin/runpodWhisperx: Runpod WhisperX Docker Containe...
Runpod WhisperX Docker Container Repo. Contribute to justinwlin/runpodWhisperx development by creating an account on GitHub.
justin
justin4mo ago
No description
justin
justin4mo ago
an ex my template
ashleyk
ashleyk4mo ago
Correct way of handling errors and causing the job to fail: If the error is a string:
return {
"error": f"An error occurred while processing the job: {e}"
}
return {
"error": f"An error occurred while processing the job: {e}"
}
If its a list or dict:
return {
"error": f"Some error message",
"output": someDict | someList
}
return {
"error": f"Some error message",
"output": someDict | someList
}
Its important to note that error key can only handle string and not list or dict.
justin
justin4mo ago
No description
justin
justin4mo ago
a another one
justin
justin4mo ago
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
black_zero
black_zero4mo ago
got it. I'll do some quick reformatting to the file. Im guessing this is for clearer error logging on the runpod side?
justin
justin4mo ago
(in case u wanted reference)
ashleyk
ashleyk4mo ago
By the way for version numbers, I recommend semantic versioning, not arb version numbers: https://semver.org/
Semantic Versioning
Semantic Versioning 2.0.0
Semantic Versioning spec and website
black_zero
black_zero4mo ago
thanks! One question I have though is if it would be better to attempt to split the image up into specific smaller domains for faster startup time (I think Im limited to 5)? Im not sure if runpod caches the images to avoid the downloading issue. got it. Honestly I should have just asked questions here sooner. Would have caused less headaches 😂
justin
justin4mo ago
They are supposed to cache my workers dont refresh i recommend always have 2 max workers minimum, preferably three, and runpod will spin up 5 idles for u (maybe 1-2 throttled) but gives u more workers to get to download ur image and work, they still honor the max workers at any given time tho But i do think if u wanna sanity check urself this tutorial is a good sanity check if ur getting diverging behavior, especially cause it so consistent on ur image i feel something is wrong but i honestly cannot fathom a guess anyways hopefully Flash can help out
ashleyk
ashleyk4mo ago
Sounds like the issue is due to pushing a new release to the latest tag.
black_zero
black_zero4mo ago
thats what Im going through now. Also just specified a tag and removed latest the pod is downloading the new tag now, so I should be able to confirm in a few minutes I think you called it. Did more tests and now Idle/Initializing pods go right to startup instead of downloading. You're the 🐐
justin
justin4mo ago
Guess that answers my never answered question before too xD https://discord.com/channels/912829806415085598/1208257003131113502 👁️
ashleyk
ashleyk4mo ago
Not me, thank @flash-singh , he nailed it.
justin
justin4mo ago
Wonder how come i was getting an infinite download too tho interesting weird weird but as long it working now