RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

TTL for vLLM endpoint

Is there a way to specify TTL value when calling a vLLM endpoint via OpenAI-compatible API?

Terminating local vLLM process while loading safetensor checkpoints

I started using Llama 3.1 70B as a serverless function recently. I got it to work, and the setup is rather simple: 2 x A100 GPUs 200GB Network volume 200GB Container storage...

Can we set public-read with rp_upload?

Using the boto module directly I can set the following when doing an s3 upload:
ExtraArgs={'ACL': 'public-read'}
ExtraArgs={'ACL': 'public-read'}
Is there a way I can apply that when using rp_upload? Below is an example of how I am using it:...

Error starting container - cpu worker

At random times, system logs show the following error and the worker does not start. For the following logs, the worker id is: 1j3jppaqs9jp0k ------ 2024-09-11T10:53:06Z worker is ready 2024-09-11T10:53:06Z start container 2024-09-11T10:53:06Z error starting container: Error response from daemon: failed to create task for container: failed to create shim task: Could not create the sandbox resource controller mkdir /sys/fs/cgroup/memory/docker/kata_171b97b3360b8a8be09e406a07d6d9d6669ceb5c147d7b7fca64cb6cd1972e04: no space left on device: unknown...

Training Flux Schnell on serverless

Hi there, i am using your pods to run ostris/ai-toolkit to train flux on custom images, the thing is now i want to use your serverless endpoint capabilities, can you help me out? do you have some kind of template or guide on how to do it?

Training flux-schnell model

How do you manage to train a flux-schnell model using serverless, i have loaded the images using s3 Bucket, but what about the waiting time of the training process ? wont i get a timeout while waiting 20 min for the training process to end?

Creation of a Unhealthy worker on startup, the worker runs out of memory on Startup.

The issue happens at random when a worker starting up for a job it has no memory to complete the job, hence it hangs endlessly in a error state

Streaming support in local mode

Is there a way to enable streaming on a local development env ? If I'm not mistaken it doesn't seem possible, which makes developping on runpod very tedious since the behavior on the local server and on the serverless platform different. See the code here: https://github.com/runpod/runpod-python/blob/a433a296dbb903f448f7c3e9a275e960812fb60b/runpod/serverless/modules/rp_fastapi.py#L330 Streaming from FastApi is possible, cf. https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse....

Creating endpoint through runpodclt

Hello there. I am trying to test and deploy a basic serverless endpoint. Issues that I have and don't understand: - If I create a serverless endpoint through the web, then I don't need to select a network volume. In this case I think I am forced to use Docker. - Because I don't want to explicitly use docker, iam using runpodclt as described: https://blog.runpod.io/runpod-dockerless-cli-innovation/ ...

Jobs in queue for a long time, even when there is a worker available

Hello, Recently I've seem a lot of jobs getting stuck in queue for long times, even though my serverless endpoint has free workers left, and the queue delay is set to 4 seconds. Does anyone has any experience with this? Any ideas why does this happen? The first screenshot depicts two jobs, submitted at the same time. One is picked up by a worker, and the other sits in queue....
No description

status: "IN_QUEUE" , what can be the issue

Hello! I'm using Postman to send POST requests with input images encoded in base64 to a serverless RunPod instance. However, regardless of the JSON structure I use, the error persists, and the workflow doesn't trigger. Can anyone help me troubleshoot this persistent issue?

Getting slow workers randomly

we’re running a custom comfy ui workflow on RTX 4090 instances with a volume attached to them. Around 70% of the time, we get normal workers where the delay time is around 8-10 seconds. But sometimes we get random slow workers. I attached a screenshot of the requests logs, you can see that the worker with hv5rbk09kzckc9 id takes around 11-12 seconds to execute the same exact comfy workflow with same gpu whereas the other worker with id lgmvs58602xe61 takes 2-3 seconds to execute. When we get a slow worker, it's just slower on every aspect. GPU inference takes 5x longer. Comfy import times take 7-8x longer than a normal worker....
No description

Collecting logs using API

Hi! Is it possible to grab serverless logs using API? I see on the web site it's using hapi endpoint. But not sure it's easily possible to do it directly somehow (didn't find in the documentation). I only see it from dashboard which is not so good in terms of gathering this into single logs cluster on our main infrastructure.

Problems with serverless trying to use instances that are not initialized

We are having problems with serverless endpoints reguarly. For some reason it tries to route request to instances that are not initialized in reality but RunPod seems to think they are initialized. This causes RunPod to show the entire 10 min initialization as "Running" even though it's not.
No description

Upgrading VLLM to v0.6.0

I was wondering what the process would be to upgrade the serverless worker code to be compatible with the latest version of VLLM.

Active workers or Flex workers? - Stable Diffusion

I'm integrating Stable Diffusion into a mobile application where user prompts are sent to RunPod for image generation, with the results sent back to the app. The usage is highly variable, ranging from 15 to 100 image generations per day, and there may be days with no usage at all. Given this variability, should I opt for active workers or flex workers in RunPod for the most efficient scaling and cost management? And in my case, what is Flex workers/Active workers suitable for?...

I shouldn't be paying for this

It says it's running, but in reality it's still initializing, even system logs are empty. I should not be charged for such dead containers!
No description

Offloading multiple models

Hi guys, anyone has experience with a inference pipeline that uses multiple models? Wondering how best to manage loading of models that exceed a worker's vram if everything is on vram. Any best practices / examples on how to keep model load time as minimal as possible. Thanks!...

Increase Max Workers

Hey there, I'm currently setting up the runpod Team/Account for our agency. We're planning on testing RunPod for a serverless SD deployment within our agency. For this we need to increase the max amount of workers I can assign to a serverless endpoint. Could someone from the RunPod team reach out to me via DM if possible? We're still in the middle of setting everything up, including the automatic payment system, so we're still stuck with the default limit....

generativelabs/runpod-worker-a1111 broken

error pulling image: Error response from daemon: pull access denied for generativelabs/runpod-worker-a1111, repository does not exist or may require 'docker login': denied: requested access to the resource is denied