Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Serverless DockerFile cache

Hi,every one,i need help My RunPod serverless endpoint frequently rebuilds Docker images, causing the cache to invalidate and forcing a re-download of over 100GB of data. However, sometimes the cache works fine. This inconsistency is driving me crazy. How can I ensure Docker caching works reliably?...

1h serverless waiting build

how to start the build of my serverless project?
No description

error creating container: nvidia-smi: parsing output of line 0: failed to parse (pcie.link.gen.max)

I've been testing the cold start, and it works 3-4 times, but then I get the above error. I'm using the serverless loadbalancer endpoints.

Insane delay times as of late

Hi, I've been experience really long delay times the last few days (2+ minutes) I can't really afford for this to happen and I believe I've noticed this before after the pods were active for some time. I'm not sure if this is because of some leakage or something....

Severe performance disparity on RunPod serverless (5090 GPUs)

I’ve deployed workflows on RunPod serverless with 5090 GPUs, and the performance differences I’m seeing are concerning. Same endpoint, same model, same operation — yet the results vary a lot: Sometimes the workflow finishes in around 44 seconds...
No description

ReadOnly Filesystem

Hi Runpod, Are network storage volumes only in READ ONLY mode when mounted on serverless endpoints when running. I get this error when the cached model on the network storage is trying to get updated with changes from huggingface. See attached log...

Incorrect configuration in worker-load-balancing example

In the documentation example here: https://github.com/runpod-workers/worker-load-balancing/tree/main It says to set these: PORT = 5000 PORT_HEALTH = 5001...
No description

how to load multiple models using model-store

Title, Because as I can see we can only cache one model for now from hugging face.

Stucked in queue, but workers available

Hi, I have a my request stucked in queue even tho I have 4 workers available. What's going on ?...
No description

How to configure auto scaling for load balancing endpoints?

From the documentation: "The method used to scale up workers on the created Serverless endpoint. If QUEUE_DELAY, workers are scaled based on a periodic check to see if any requests have been in queue for too long. If REQUEST_COUNT, the desired number of workers is periodically calculated based on the number of requests in the endpoint's queue. Use QUEUE_DELAY if you need to ensure requests take no longer than a maximum latency, and use REQUEST_COUNT if you need to scale based on the number of requests." From what I understand the load balancing endpoints don't have a queue? How do I configure the auto scaling to work with serverless endpoints?...

Unable to connect to a serverless load balancing workers

I'm running a serverless load balancing endpoint for my Fast API server, although when I send a request to the endpoint I get 400 response after over two minutes. Moreover, HTTPS serivces marked unready and web terminal is not starting. I have set PORT in env variables with the same value my server running on. I cannot see errors anywhere. How can I fix that?...

Builds pending for hours, then failing with no logs

I've had the situation several times that builds are pending for hours and then stall with Build Failed and No logs yet.... Re-reunning the build would then often succeed without any change to it. Is there any way to circumvent this?
No description

Network volume selection has disappeared from serverless endpoint creation process.

^the title. There's only this new "Model" now, which seems to be super cool when it'll be out of beta. Also can i recommend something? If there's a way to see which huggingface models are "cached", that'll be so cool too!
Solution:
Thanks for the bug report and feature request! You can attach a network volume after the serverless endpoint is created, and I've passed the feature request forward.

Pre-cached model selection doesn't appear to existing when creating a new serverless endpoint

The docs (https://docs.runpod.io/serverless/endpoints/manage-endpoints) say: """ Model (optional): Select a model from Hugging Face to optimize worker startup times. When you specify a model, Runpod attempts to place your workers on host machines that already have the model cached locally, resulting in faster cold starts and cost savings (since you won’t be charged while the model is downloading). You can either select from the dropdown list of pre-cached models or enter a custom Hugging Face model URL. """...

What is this?

error starting container: Error response from daemon: failed to create task for container: failed to create shim task: unable to write to a control group file /sys/fs/cgroup/docker/28efdc7a49e4ef7997c32dd12467dd5fbc8d6763db2da6357c5b87e10e513ff9/memory.oom_control, value [CREATE FILE] caused by: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }

AI Toolkit with Serverless

Is there a way to get the AI toolkit image that's available as a Pod template for serverless? I'm looking for a way to train WAN LoRAs with an endpoint, any help in this would be super appreciated.

serverless down ?

There is an error saying no gpu avaialble, yet our worker is running and being charged.. What is going on ?
No description

Please resolve this really urgent issue.

I'm unable to connect my pod with this issue: "This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime." My server was running on and it mustn't be shopped. Could you resolve this issue asap. My pod ID is "vjwinhaduxgt3w"...

No workers available in EU-SE-1 (AMPERE_48)

I deployed endpoint s7gvo0eievlib3 hours ago with storage attached. Build was fine and release was created. But I don't have any workers assigned. The GPU is set to AMPERE_48 of which it said High Supply. What am I doing wrong and how do I fix this?

Can't load load model from network volume.

I'm trying to load model from network volume with my serverless worker with environment MODEL_NAME, but even when setting up the template I got this error:
Failed to save template: Unable to access model '/workspace/weights/finexts'. Please ensure the model exists and you have permission to access it. For private models, make sure the HuggingFace token is properly configured.
Failed to save template: Unable to access model '/workspace/weights/finexts'. Please ensure the model exists and you have permission to access it. For private models, make sure the HuggingFace token is properly configured.
...
Next