Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0...

Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?

I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo. I run into OutOfMemory issues when trying it on 48GB GPU....

Severless 404

Hi there, I'm getting a 404 error when sending requests on a develpment session (runpodctl project dev). Everything worked great locally using the --rp_serve_api, the only difference is that I changed the url from local host to https://api.runpod.ai/v2/my_pod_id/runsync and added the authentication key to accommodate for the deployment. I'm using postman to send the request Has anyone faced this problem? Can't figure what I'm doing wrong...

Unacceptably high failed jobs suddenly

Suddenly almost 20% of my serverless jobs failed. I have never had this issue until yesterday. This is is completely UNACCEPTABLE that I am being charged for this immense fuck up and that my customers are being impacted. This needs to be resolved IMMEDIATELY and I demand a refund for this!

Two Network Volumes

Id like to have an endpoint with a network volume but across data centers. I don’t mind if network volumes are in a different state as they act like a cache, however i’d like my traffic to be load balanced. Is that possible ?...

container start command troubleshooting

hello, i am trying to create a template with the following start commands: ```apt-get update && apt-get upgrade -y && \ apt-get install -y git nodejs npm jq nano vim python3-pip python3-dev && \ npm install -g pm2 && ...

Active worker keeps downloading images and Im being charged for it

why is it that a worker will finish downloading, extracting, and initializing--then get into a 'worker is ready' state to only go back to downloading when it receives a job? Its just wasting credits at this point...and fairly frustrating.
No description

Webhook problem

Some requests (about 5%) are not recieved by our webhook. Strange thing is that we recieve updates, but they are with uknown task ids, which we can't find in our DB

optimize ComfyUI on serverless

I have ComfyUI deployed on runpod serverless, so I send the json workflows to runpod and receive the generated images in return. Right now, all my models are stored in a network volume. However, I read that loading the models from a network volume is not optimal. In each workflow, I either use Stable Diffusion 1.5 or Stable Diffusion XL. My 1.5 and sdxl workflows always share some models (such as the checkpoint) but otherwise require different models with each request. I am thinking about the following options to optimize further: 1. bake almost all the models, except the loras, into one docker image (about 30 GB)...

Probleme when writing a multi processing handler

Hi there ! I got an issue when I try to write a handler that processes 2 tasks in parallel (I use ThreadPoolExecutor). I use the transformers library by HF for loading the models and I use Langchain to process the inference. I tested my handler on Google collab, it works well, so I create my docker template and create an endpoint in Runpod, but when it comes to the inference, I constantly have an error : CUDA error: device-side assert triggered. Which I don't have when I test the handler on collab. How can I handle that, and particularly, what can cause this error ? Because I use a 48GB GPU (which is highly sufficient for my models that take around 18 GB in total), so it can't be a resource issue....

Idle time: High Idle time on server but not getting tasks from queue

I'm testing servers with high Idle time to keep alive and get new tasks, but the worker is showing idle and finished but not getting new tasks from the Queue. Is there any event or state I need to add to the handler?...

Is there a programatic way to activate servers on high demand / peak hours load?

We are testing the serverless for production deployment for next month. I want to assure we will have server times during peak hours. We'll have some active servers but we need to guarantee load for certain peak hours, is there a way to programatically activate the servers?...

Increasing costs?

guys last few days seems an increase in cost without a spike in active usage. do you have any idea why that might be?
No description

[URGENT] EU-RO region endpoint currently only processing one request at a time

We have a production endpoint running in the EU-RO region but despite us having 21 workers 'running', only one seems to be getting processed This is causing delays and timeout errors from our requests. Can we get some help please?...
No description

Returning error, but request has status "Completed"

Hello, I'm using validate() from rp_validator to validate my input data against a schema. The relevant line of code to trigger the error is: ``` validated_input = validate(input, INPUT_SCHEMA)...

Can I emulate hitting serverless endpoints locally?

So far I've been testing my runpod serverless locally by executing the python handler
python -u handler.py
python -u handler.py
but is there any way to emulate hitting the serverless endpoint locally?...

All 27 workers throttled

Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...

I'm using SDXL serverless endpoint and sometimes I get an error.

error message is this:
RuntimeError: expected scalar type Float but found Half, Stack Trace: <traceback object at 0x7f779ace2a00>
RuntimeError: expected scalar type Float but found Half, Stack Trace: <traceback object at 0x7f779ace2a00>
...

API Wrapper

curl -X POST https://api.runpod.ai/v2/stable-diffusion-v1/run \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' \ -d '{"input": {"prompt": "a cute magical flying dog, fantasy art drawn by disney concept artists"}}' ...