Runpod•17mo ago

Billed for endpoint stuck in state: Service not ready yet. Retrying...

Hey there, we have a serverless endpoint that seems to be stuck in a state: Service not ready yet. Retrying... It has been stuck in this state for about 18 hours now and it appears we're getting billed for this (already $22 for a day) even though no resources are being used. We can't get it out of this state and we don't want to put more money on our account until this is resolved. Is there anything we can do to stop this? Is this a common problem? Or a rare glitch? I submitted a support ticket online as well with further details about the specific endpoint ID

63 Replies

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Thanks @nerdylive I was mistaken on a few things. The total billed time was 14hr. Looking through the logs, this was happening for 2 hours. But I think it was happening to multiple workers, so perhaps that is how it added up to 14hr charge. Could it perhaps have to do with workers trying to connect to network storage and failing? And then we are getting billed for that?

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Region: Oregon Time: July 5, 11:51pm MT – July 6, 1:58am MT We've had the same endpoint deployed for roughly a month or so now.. maybe 3 weeks? Been working great AFAIK. But I have seen these messages before.

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Additional info if it helps. You can see total execution time is 164s, Cold start time is 95s so total (164+95)*0.00044 so should be roughly $0.12.

ArjunOP•17mo ago

I'm using @ashleyk's worker image and I can't find any reference to that log in the code: https://github.com/ashleykleynhans/runpod-worker-a1111 Doh here it is: https://github.com/ashleykleynhans/runpod-worker-a1111/blob/main/rp_handler.py#L44

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

So wait for service... not sure what this method does tbh

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Ah does it have to be through jupyter? Can I ssh? I just started the pod without Jupyter, hah

ArjunOP•17mo ago

message.txt

ArjunOP•17mo ago

sqlite3.DatabaseError: database disk image is malformed Hmm

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

77% of the volume, 0% of container

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Yeah, it might be the sqlite db for A1111, might just delete it and try relaunching

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

So I can relaunch A1111 no problem, no db issues. Hmm

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

No just running from the Pod. I did delete the cache db anyway. I'm not sure that was the issue though because it continued to execute on the inferences. Very strange.

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

So perhaps it's possible that the network volume is taking a long time to attach, it got stuck somehow and the endpoint is billing right away. But we had set Execution timeout to 600 seconds (default). So 10 minutes max. I would think that would kill the worker. But this went on for 2 hours Not sure where you are @nerdylive but it is getting late here. I'm really grateful for all your support! I will update the zendesk ticket to fill them in about what we learned here, and maybe that will help us understand what happened to the billing. I think we'll try to move to direct storage vs network storage moving forward. Thanks again @nerdylive !

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

@nerdylive I can start another thread about this, but I understand that the main container disk is non persistent. However I assumed with templates we can spec a volume disk which is persistent? In that case, we could have our dockerfile (or start.sh?) load in/configure A1111 + models onto the volume disk if they don't exist, and then use that between executions? Anyway, sleep for me! hah Thanks again

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

@nerdylive btw this is happening again. Nothing is getting picked up from the queue

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

It's all the stuff from Ashley's image including, controlnet, adetailer etc. But honestly it just seems like everything is moving very slow on Runpod right now. I haven't changed anything, same configuration and same extensions as I have for the past month. Just the past few days have been really glitchy. Will check logs now one sec webui.log is empty

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Ohhhh, geez. I'm sorry. I am out of network volume space. That must be the issue

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Well, it's strange because I had 77% usage of a 65GB network volume. So roughly 15GB free right? I just downloaded a checkpoint ~7GB. And suddenly now it's all used up.

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Yeah I did, somehow venv is using 14G

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

I didn't install any new packages, just a new model ~7G. Anyway, yeah something filled that space for sure. I will need to investigate. I think I'd really like to find a way to stop using network storage, and have a single model per endpoint or something. Do you know what most people do?

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

If it needs to then run from network volume anyway, wouldn't it also be the same speed, even slower?

ArjunOP•17mo ago

Ah, it's happening again! It seems only one or two workers get stuck like this. Very strange.

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

Yup just creating a pod again, hah

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

89% still (I deleted an unused model to free up some space) and it hasn't changed since Super weird, the cat /workspace/logs/webui.log keeps changing every time I run it Like drastically different

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

message.txt

ArjunOP•17mo ago

Got a live one.

Unknown User•17mo ago

Message Not Public

ArjunOP•17mo ago

The sqlite dbs are only 8MB AFAIK, it's just the stable diffusion webui cache files It's strange because we'll run all 10 workers, and it will chew through the queue, but one will get stuck in this state.

Unknown User•17mo ago

Message Not Public

Jehex•16mo ago

Hi, did you fix this issue ? I have the same problem

Jehex•16mo ago

Encyrption•16mo ago

If you want to get a closer look at your network volume run this pod (gives you web file explorer view): https://runpod.io/console/deploy?template=lkpjizsb08&ref=a57rehc6 Mount the network volume you want to work with when you deploy this template. It should be mounted (to /workspace) By default the username and password will be "admin" and "admin".

Jehex•16mo ago

everything worked fine few hours ago I didnt touch anything why I get this issue now ? this issue happen with all my storages it's def an issue from runpod or the template itself

Unknown User•16mo ago

Message Not Public

Marcus•16mo ago

I think its more A1111 issue when upgrading A1111 from one version to the next.

Unknown User•16mo ago

Message Not Public

Marcus•16mo ago

Seems to be the case, but I am not sure. I think the sqlite DB becomes corrupted when upgrading because the structure changes, but that is just an assumption, someone will need to test it to confirm.

Unknown User•16mo ago

Message Not Public

ArjunOP•16mo ago

I just switched to fully containerized and dropped network storage altogether, it was too buggy. Dropped my bill from runaway processes from $25/day to $5/day. Are you using https://github.com/ashleykleynhans/runpod-worker-a1111 ?

GitHub

GitHub - ashleykleynhans/runpod-worker-a1111: RunPod Serverless Wor...

RunPod Serverless Worker for the Automatic1111 Stable Diffusion API - ashleykleynhans/runpod-worker-a1111

Marcus•16mo ago

This issue is due to corrupt files within the venv. It seems to happen when you use more than 1 template for A1111 on the same network storage. It seems that you can fix it as follows: Step 1: Activate venv Step 2: Reinstall torch modules and clear __pycache__ files

pip3 install -U --force-reinstall torch==2.4.0+cu121 xformers==0.0.27.post2 torchvision torchaudio --index-url=https://download.pytorch.org/whl/cu121
find . -name __pycache__ | xargs rm -rf

pip3 install -U --force-reinstall torch==2.4.0+cu121 xformers==0.0.27.post2 torchvision torchaudio --index-url=https://download.pytorch.org/whl/cu121
find . -name __pycache__ | xargs rm -rf

Jehex•16mo ago

Thanks a lot Marcus

ArjunOP•10mo ago

This problem has returned, even though I'm not using a network storage at all. Two separate days with regular #executions, regular execution time, regular cold start time, but huge bill. I tried looking at logs, but I'm not finding anything of note that could be causing this.

Gaming

Programming

Billed for endpoint stuck in state: Service not ready yet. Retrying...

Did you find this page helpful?