Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Runpod serverless for Comfyui with custom nodes

I want to use two custom nodes in ComfyUI in runpod serverless: ComfyUI_CatVTON_Wrapper It requires the following dependencies:...

How to deploy ModelsLab/Uncensored-llama3.1-nemotron?

I have tried to deploy this model https://huggingface.co/ModelsLab/Uncensored-llama3.1-nemotron Btw I am facing cude memory issue(I have tried 24gb, 48gb), it does not work, how to fix?...

Almost no 48GB Workers available in the EU

It looks like you're getting rid of A40's. There's no EU region that offers both the A40 and A6000, that's terrible if one stores stuff on Network Volumes. Is there more capacity coming soon?...

GitHub integration: "exporting to oci image format" takes forever.

It's been running for over 30 minutes on this step. Same image builds in less than 5 minutes in GitHub Actions. Why does it take so long? This is the first build. Would it be better for subsequent builds (assuming there's some caching involved?)? To me this is unusable and I much rather just do the build and push myself and just change the endpoint image version....

vllm worker OpenAI stream

Hi everyone, I followed the Runpod documentation to create a simple OpenAI client code using a serverless endpoint for the Llava model (llava-hf/llava-1.5-7b-hf). However, I encountered the following error:
ChatCompletion(id=None, choices=None, created=None, model=None, object='error', service_tier=None, system_fingerprint=None, usage=None, code=400, message='As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', param=None, type='BadRequestError')
ChatCompletion(id=None, choices=None, created=None, model=None, object='error', service_tier=None, system_fingerprint=None, usage=None, code=400, message='As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', param=None, type='BadRequestError')
...

Trying to work with: llama3-70b-8192 and I get out of memory

Hi I am trying to work with the model: llama3-70b-8192 but I cant deploy my serverless endpoint because out of memory. I have attached image config screenshot. please reccoment on other settings to make it work [rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU Thanks...
No description

Incrase serverless worker count after 30.

Dear Runpods team.   We are in the process of transitioning our inference operations for aisuitup.com (AI Image Generation service) from AWS to Runpod. To support our growing needs, we will require an increase in our serverless pod capacity after the current limit of 30.   Please let us know the steps needed to facilitate this increase and any additional information or configurations required on our end....

Consistently timing out after 90 seconds

I'm not exactly sure why this is happening, and I don't think this happened earlier, but currently I'm consistently seeing requests timeout after 90 seconds. Max. execution time is set to 300 seconds, so this shouldn't be the issue. Is this a known problem?...
No description

Upload files to network storage

i use network storage to storing lora files Can I automate the process of uploading the assets I need to the network storage? It is used in serverless I can't use my s3 storage because the speed will be much slower....

Serverless problems since 10.12

I am using serverless since few months quit stable, but since 10.12 all my requests execute after 25 seconds i already tried all different settings but at the end the process stops after 25 seconds and i get a error. I changed nothing on Docker ore on my files same settings since weeks. { "delayTime": 4967, "error": "Error queuing workflow: <urlopen error [Errno 111] Connection refused>",...
No description

Git LFS on Github integration

When using the new Github integration workflow, I noticed corrupted large files, so I wanted to make sure that you had Git LFS installed in the environment that pulls the Git repositories. Correct?

Using runpod serverless for HF 72b Qwen model --> seeking help

Hey all, I'm new to this and tried loading a HF Qwen 2.5 72b variant on Runpod serverless, and I'm having issues. Requesting help from runpod veterans please! Here's what i did:...

Docker Image EXTREMELY Slow to load on endpoint but blazing locally

This is the first time I'm encountering this issue with the serverless EP I've got a docker image, which loads the model (flux schnell) very fast, and it runs a job fairly fast on my local machine with a 4090. When I use a 4090 in RP though, the image gets stuck at loading the model ```self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)...

Constantly getting "Failed to return job results."

``` {5 items "endpointId":"mbx86r5bhruapo" "workerId":"r23nc1mgj01m13" "level":"error"...

Why is my serverless endpoint requests waiting in queue when theres free workers?

This has been happening,when two people try to make a request at the same time, the second users request will wait in queue until the first request is completed instead of trying to use another worker. I have 4 workers avaliable on my endpoint so thats not the issue. I set the queue delay to 1 second because thats the lowest possible but it doesn't do anything. Is the serverless endpoint suppose to work in production?

Github integration

@haris Trying the new github integration. It says it gives "Read and write access to code" permissions. Why does the github integration require WRITE access to code?

Is VLLM Automatic Prefix Caching enabled by default?

Hello! I setup a Serverless quick deployment for text generation and I was wondering if VLLM Automatic Prefix Caching is enabled by default? Also see: https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html ...

vllm worker OpenAI stream timeout

OpenAI client code from tutorial (https://docs.runpod.io/serverless/workers/vllm/openai-compatibility#streaming-responses-1) is not reproducible. I'm hosting 70B model, which usualy has ~2 mins delay for request. Using openai client with stream=True timeouts after ~1 min and returns nothing. Any solutions?...

VLLM model loading, TTFT unhappy path

I am looking for a way to reduce the latency for the unhappy path of VLLM endpoints. I use the quickstart VLLM template, backed by a network storage for model weights and flashboot enabled. By default the worker will load the model weights on first request. This, however poses the risk of exposing my customers to an unhappy path of latency measured in minutes, at scale we could see this in significant absolute numbers. What would be the best way for me to make sure that a worker is considered ready only >after< it has loaded the model checkpoints, and trigger checkpoint loading without sending the first request? Should I roll my own VLLM container image? Or is there an idiomatic way to parametrize the quickstart template to achieve this? I would prefer to use the Runpod supplied, properly supported VLLM image, if possible....
No description

can't pull image from dockerhub

2024-12-11T12:28:04Z 257642480b4e Extracting [==================================================>] 33.06GB/33.06GB 2024-12-11T12:28:04Z failed to pull image: failed to register layer: archive/tar: invalid tar header @Zeke...