Runpod•2y ago

Active worker keeps downloading images and Im being charged for it

why is it that a worker will finish downloading, extracting, and initializing--then get into a 'worker is ready' state to only go back to downloading when it receives a job? Its just wasting credits at this point...and fairly frustrating.

57 Replies

J.•2y ago

can u share ur template screenshot? if ur okay with that?

zer∅OP•2y ago

its a standard template as far as im aware. The image itself is ~30GB

J.•2y ago

Ah ok nvm guess is private then. does ur container image have a tag? just to confirm? And is the platform on dockerhub indicating amd64? like should be username/image:1.0 or something ur container disk can be like 50gb then too

zer∅OP•2y ago

yeah its on dockerhub. Tbh Im not too concerned about it being public: teamclashofficial/dot-diff-comfy:latest

J.•2y ago

zer∅OP•2y ago

just wondering why it has to redownload every other moment. For example, one that was just idle for a minute had to do a complete redownload. Would a network drive be more suitable depsite the less available GPUs?

J.•2y ago

It doesnt, this might be a runpod issue let me check my stuff Refreshing ur page doesnt show they are idle? let me try to deploy ur stuff so i can see too

zer∅OP•2y ago

yeah when I refresh at times, it'll show like 2 are idle--then when I push a job to it, I'll check and they'll be downloading in an active state k

J.•2y ago

Maybe try to delete the endpoint and remake it? :/ ive seen the active downloading thing before myself

zer∅OP•2y ago

I'll try now

J.•2y ago

yea im deploying my own endpoint again template to see if it also ur template vs a wide issue will lyk

ashleyk•2y ago

Seems like a bug, only max workers should redownload the docker image, not active workers.

J.•2y ago

Do u mind to share ur dockerfile? are u calling the python handler? i can replicate it

zer∅OP•2y ago

just deleted and deployed a new endpoint. Here's the id so you can monitor: 07o7r7hqcd22z1 it's set to 1 active and 3 workers

J.•2y ago

nah no need set to 0,0 this might be a flash thing to ask

zer∅OP•2y ago

yeah

dockerfile

J.•2y ago

testing my own templates now to see if its rhe same

zer∅OP•2y ago

J.•2y ago

is ur start.sh calling python handler? or what is it calling

zer∅OP•2y ago

its calling a custom server.py file that has the runpod handler

server.py

J.•2y ago

Start the handler only if this script is run directly if name == "main": runpod.serverless.start({"handler": handler}) try to get rid of this if check i think this is causing a bug

zer∅OP•2y ago

J.•2y ago

where runpod isnt catching ur handler urgh runpod error checking rlly sucks i wish there better error debugging

zer∅OP•2y ago

lol true. This may take a while as I'll have to rebuild

J.•2y ago

no worries why dont u try a fake one for now

ashleyk•2y ago

Its not, this is normal in Python, I do it all the time and never had an issue like this.

J.•2y ago

https://blog.runpod.io/serverless-create-a-basic-api/

RunPod Blog

Serverless | Create a Custom Basic API

RunPod's Serverless platform allows for the creation of API endpoints that automatically scale to meet demand. The tutorial guides you through creating a basic worker and turning it into an API endpoint on the RunPod serverless platform. For this tutorial, we will create an API endpoint that helps us accomplish

J.•2y ago

Got it just my guess also maybe use ur built image as the base and just copy ur handler.py over it Maybe flash can help then

ashleyk•2y ago

Its most likely just a bug with the serverless handling of active workers and treating them like max workers, there is nothing wrong with the code, image etc. Best for @flash-singh to advise, he already asked for endpoint id in #🎤｜general

J.•2y ago

Got it, will leave to @flash-singh , and i guess share ur current endpoint @black_zero6641

zer∅OP•2y ago

forwarded the id to him in general

J.•2y ago

I think keep to 0-0 so ur not burning cash

zer∅OP•2y ago

thank you for the help. Its much appreciated

J.•2y ago

i do find it weird that it’s replicable with ur image tho / not my others ones which is why i thought maybe something inherent to the image

zer∅OP•2y ago

yeah I thought I was going crazy for a sec lmao

ashleyk•2y ago

Also avoid using latest as tag, its best practice to use a version tag, but thats most likely not the cause of the issue.

zer∅OP•2y ago

it may end up being related to the image if its not happening to others. will do

ashleyk•2y ago

By the way this is also not the correct way of handling errors:

return {
            "status": "error",
            "message": f"An error occurred while processing the job: {e}",
        }

return {
            "status": "error",
            "message": f"An error occurred while processing the job: {e}",
        }

J.•2y ago

justinwlin/runpodwhisperx:1.4 https://github.com/justinwlin/runpodWhisperx

GitHub

GitHub - justinwlin/runpodWhisperx: Runpod WhisperX Docker Containe...

Runpod WhisperX Docker Container Repo. Contribute to justinwlin/runpodWhisperx development by creating an account on GitHub.

J.•2y ago

an ex my template

ashleyk•2y ago

Correct way of handling errors and causing the job to fail: If the error is a string:

return {
    "error": f"An error occurred while processing the job: {e}"
}

return {
    "error": f"An error occurred while processing the job: {e}"
}

If its a list or dict:

return {
    "error": f"Some error message",
    "output": someDict | someList
}

return {
    "error": f"Some error message",
    "output": someDict | someList
}

Its important to note that error key can only handle string and not list or dict.

J.•2y ago

a another one

J.•2y ago

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless

GitHub

GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

zer∅OP•2y ago

got it. I'll do some quick reformatting to the file. Im guessing this is for clearer error logging on the runpod side?

J.•2y ago

(in case u wanted reference)

ashleyk•2y ago

By the way for version numbers, I recommend semantic versioning, not arb version numbers: https://semver.org/

Semantic Versioning

Semantic Versioning 2.0.0

Semantic Versioning spec and website

zer∅OP•2y ago

thanks! One question I have though is if it would be better to attempt to split the image up into specific smaller domains for faster startup time (I think Im limited to 5)? Im not sure if runpod caches the images to avoid the downloading issue. got it. Honestly I should have just asked questions here sooner. Would have caused less headaches 😂

J.•2y ago

They are supposed to cache my workers dont refresh i recommend always have 2 max workers minimum, preferably three, and runpod will spin up 5 idles for u (maybe 1-2 throttled) but gives u more workers to get to download ur image and work, they still honor the max workers at any given time tho But i do think if u wanna sanity check urself this tutorial is a good sanity check if ur getting diverging behavior, especially cause it so consistent on ur image i feel something is wrong but i honestly cannot fathom a guess anyways hopefully Flash can help out

ashleyk•2y ago

Sounds like the issue is due to pushing a new release to the latest tag.

zer∅OP•2y ago

thats what Im going through now. Also just specified a tag and removed latest the pod is downloading the new tag now, so I should be able to confirm in a few minutes I think you called it. Did more tests and now Idle/Initializing pods go right to startup instead of downloading. You're the 🐐

J.•2y ago

Guess that answers my never answered question before too xD https://discord.com/channels/912829806415085598/1208257003131113502 👁️

ashleyk•2y ago

Not me, thank @flash-singh , he nailed it.

J.•2y ago

Wonder how come i was getting an infinite download too tho interesting weird weird but as long it working now

Gaming

Programming

Active worker keeps downloading images and Im being charged for it

Did you find this page helpful?