Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Why serverless endpoint download sdxl1.0 from hugging Face hub so slow?

My project need download sdxl 1.0 form hugging face hub, about 7G,but after waiting 50minutes, job is still in queued, and log nothing. 48G GPU High Avaliability. Is there anything wrong ? detail see attachment.
No description

I am getting no response from serverless

while i am testing my image, i am getting no response, no errors, CPU usage is 100%. Is it because i am using a small machine, should i increase the size From logs, it seems like the process is restarting...
Solution:
maybe u didnt call the runpod.start function?
No description

secure connections

I want to ensure all traffic between my app and the server less backend is encrypted. Does the endpoint decrypt the traffic from the internet and transmit in plaintext to the server less container? Specifically, is the data in my prompt In clear text even in memory before it reaches the container?...
Solution:
In theory you could make own worker that input would be encrypted file and it would be decoded on container itself though you would need make that code yourself

server less capability check

I want to add runpod into a tier of load balanced llm models behind an app like openrouter.ai, but the decision will occur in our infrastructure. When i invoke a server less instance with my app and a task is completed, how am I billed for idle time if the container unloads the model from gpu memory? In other words I want to reduce costs and increase performance by only needing to load the model after an idle timeout, paying only for the small app footprint in storage/memory...
Solution:
You are charged for the entire time the container is running including cold start time, execution time and idle timeout.

GPU memory usage is at 99% when starting the task.

I started to notice some GPU OOM failure today, and it's specific to this instance: A40 - 44adfw5inhfp98 When the job starts, it says the GPU utilization is at 99%. Did something change on RP?...

Should i wait for the worker to pull my image

I have a large image (100 GB), should i wait for worker to pull the image before starting any inference

Possible memory leak on Serverless

We're testing different mistral models (cognitivecomputations/dolphin-2.6-mistral-7b and TheBloke/dolphin-2.6-mistral-7B-GGUF) and running into the same problem regardless of what size GPU we use. After 20 or so messages the model starts returning empty responses, we've been trying to debug this every way we know how but it just doesn't make sense as the context size is around the same for each message so its due to an increasing number of prompt tokens. What i've noticed is that even when the w...
No description

Dockerless dev and deploy, async handler need to use async ?

handler.py in HelloWorld project, there is not 'async' before def handler(job): . But in serverless endpoint, there are Run and RunSync . So if I want to use async handler, Is it necessary to go like this ? Or am I misunderstanding something? async def async_generator_handler(job): for i in range(5):
output = f"Generated async token output {i}"...

Something broken at 1am UTC

Something was broken at 1am UTC which caused a HUGE spike in my cold start and delay times.
No description

Should I use Data Centers or Network Volume when confige serverless endpoint ?

My project is an AI portrait app targeting global users. The advantage of using data centers is the ability to utilize GPUs from all data centers, while network volumes can speed up model loading times,However, GPU usage is limited to the data center where the network volume is located. How should I choose?

Are stream endpoints not working?

This is a temp endpoint just to show you all. /stream isn't available, what's up?...
No description

Postman returns either 401 Unauthorized, or when the request can be sent it returns as Failed, error

Postman reads the following, when I send runsync request from runpod tutorial (from generativelabs) json "error": "Unexpected input. api_name is not a valid input option.\nUnexpected input. cfg_scale is not a valid input option.\nUnexpected input. email is not a valid input option.\nUnexpected input. negative_prompt is not a valid input option.\nUnexpected input. num_inference_steps is not a valid input option.\nUnexpected input. override_settings is not a valid input option.\nUnexpected input. prompt is not a valid input option.\nUnexpected input. restore_faces is not a valid input option.\nUnexpected input. sampler_index is not a valid input option.\nUnexpected input. seed is not a valid input option.\napi is a required input.\npayload is a required input.", When I send request from ashleyk's worker a1111 json it always returns as 401 Unauthorized. How can I solve?...

Text-generation-inference on serverless endpoints

Hi, I don't have much experience neither with llms nor with python, so I always just use this image 'ghcr.io/huggingface/text-generation-inference:latest' and run my models on Pods. Now, I wanna try serverless endpoints, but I don't know how to launch text-generation-inference on serverless endpoints, can someone give some tips or maybe there are some docs which could help me.

Cold Start Time is too long

When i test a HelloWorld project, run , it take too much time. Worker Configuration as attachment, I have enable FlashBoot, it say can reduce cold start time to 2 s. In Documentation, I see "The Delay Time should be extremely minimal, unless the API process was spun up from a cold start, then a sizable delay is expected for the first request sent." "a sizable delay" mean if from a cold start, it may be 12s? Is there anything I misunderstand? please let me know.
No description

What happened to the webhook graph?

There was a webhook graph for serverless but I can't seem to find it anymore. Was it removed for some reason?

How i can use more than 30 workers?

i've tested my task with 30 workers and realized that i need more) is it possible to get 40 or more?...

What is the caching mechanism of RUNPOD docker image?

our Docker image is stored in AWS ECR. We've noticed that every time we update the Docker template on the runpod, our ECR costs increase rapidly. We've identified that this is due to using 80 runpod instances, and these runpod instances concurrently pull the image. We would like to inquire about the image caching mechanism of runpod. If we let one runpod instance pull the Docker image completely first before starting other instances, will the other instances pull the image from your cache instea...

serverless deployment

i want to deploy my llm on serverless endpoint, how can i do that?

How to know when request is failed

Hello, everyone I am using webhook to be notified for job completion. I wondering if this webhook is also called when request is failed. Or is there any other way to know whether request is failed? ...