Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Run Mixtral 8x22B Instruct on vLLM worker

Hello everybody, is it possible to run mixtral 8x22B on vLLM worker i tried to run it on the default configuration with 48 gb GPU A6000, A40 but its taking too long, what are the requirements for running mixtral 8x22B successfully ? this is the model that im trying to run https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

Output guidance with vLLM Host on RunPod

Greetings! I've been using vLLM on my homelab servers for a while and I'm looking to add the ability to scale my application using RunPod. On my locally hosted vLLM instances, I use output guidance via the "outlines" guided decoder to constrain LLM output to specified Json Schemas or Regex. One question I haven't been able to find an answer to: Does RunPod support this functionality with serverless vLLM hosting in the OpenAI API? (I assume it supports it with pods if you set up your own instance of vLLM)...

Serverless broke for me overnight, I can't get inference to run at all.

Hi, I was using runpod/worker-vllm:stable-cuda12.1.0 in my production app with the model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ. There appears to have been an update in the last 24 hours or so that broke my app completely. I have since spent the last six hours trying to get ANYTHING out of ANY endpoint, and I just can't get anything running. Prior to today, this was running uninterrupted for over a month. I have tried: - Rolling back to runpod/worker-vllm:0.3.1-cuda12.1.0 - Swapping out models; tried easily 8 or 9 different ones, mostly mixtral variants. I have tried AWQ, GPTQ and unquantized models. ...

Please focus on usability

Guys it’s 2024, I expect a service that costs thousands of dollars a month to add the four tailwind classes it takes to make this UX work on mobile. And if some of the settings are invalid, this dialog shouldn’t close when the error is raised, forcing me to re enter the whole lot again. Your documentation is also a gigantic mess of broken links. ...
No description

Incredibly long queue for CPU Compute on Toy Example

I am trying to run the code in this blog post, https://blog.runpod.io/serverless-create-a-basic-api/. The wait time for this simple function to execute has been above 5 minutes. There are two items in the queue that are not moving. There are no error in the logs. It appears stuck in the "Initializing" state with no workers spinning up. How can I fix this? Also when I tried to create the endpoint, the UI would not allow me to select the template I created earlier....

How to WebSocket to Serverless Pods

@Merrell / @flash-singh Wondering if I can get code-pointers on how to use the API to expose ports on serverless programatically, so that I can do stuff like web sockets?

Docker build inside serverless

Hey all, I am pretty new to serverless, python and docker. I am trying to build an image that I want to run on runpod, but I am running into the issues that some of the dependencies cannot be build on my macbook pro or github actions. So I thought, why not build the image on runpod as well. I created the following handler.py (See message below)...

Running fine-tuned faster-whisper model

Hello. Is it possible to run a fine-tuned faster-whisper model using RunPod's faster-whisper endpoint? Furthermore, does it work on a scale of hundreds of users using it at the same time?...

Understanding serverless & prising. Usecase: a1111 --api w. control on serverless

Right now im running some version of juggernautxl w controlnet, i spend about 20 seconds on generating 1 images, sometimes less on a5000. However, sometimes my reponse time on my endpoints are very slow. Im trying to figure out if this is purely because of cold start time, or because my requests is in the que before landing on a GPU. I guess my question is: -How to see when GPU is in que vs cold start, and is que time billed? How to control que time? -Does anyone have experience in reducing cold start time for a1111 serverless requests, and or maybe have setup some difffuser setup without a1111 for serverless that function really good? ...

Problem with serverless endpoints in Sweden

My jobs are failing with ""executionTimeout exceeded"" after more than 5 minutes, for jobs that shouldn't take more than 2 minutes to run. ``` { "delayTime": 892, "error": "executionTimeout exceeded",...

Serverless Error Kept Pod Active

I have a LLM deployed on runpod serverless. Due to a certain error that occurred after a request: The pod got stuck on {"requestId": null, "message": "Failed to get job, status code: 502", "level": "ERROR"} This kept pod active and therfore caused me to lose money. Shouldn't pod deactivate automatically in case of errors after certain time?...

Is it possible to SSH into a serverless endpoint?

I can SSH into the box when I deploy docker images to regular pods. But is it possible to SSH into them when deploying them to serverless endpoints? I want to be able to troubleshoot and inspect the built image but I'm unsure how to ssh into it. for the regular pods I just click the "Connect" button and there I can see the IP address and command to ssh. How to do it with serverless?

How to authenticate with Google Cloud in a Docker container running Serverless?

I'm trying to authenticate using a service account json key file running a Docker container so I can store objects to GCS. I've added the json file's content as a Secret but without success. Am I missing something here or how would you advise me to authenticate? Update: Looks like the entire json file's content doesn't fit as a Secret which explains why it doesn't work. Still, I'd like to find a way to authenticate....

When update handler.py using "runpodctl project deploy", Old worker does not auto update handler

Although handler.py in deploy directory in volume has updated, but if I use run to test, the old worker(has been run before)will still output "world",not "world-2". I have to remove the old worker, and then click 'run' again for the new worker to output correctly....
No description

Webhook's in runpod

I want to trigger a api call to my server when a request is completed processing using webhook i found that there is a webhook parameter but I did not understand how to use it or how i can test it locally

reduce serverless execution time

My runpod api generation text to image. image generation is just use 1 sec but upload the generated image api call spend 4sec. How to reduce it? We have to provide image file to the user

Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback

Today the testing on Serverless vLLM has been a very bad experience. It is extremely unstable. Out of the blue we started getting the error message:
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
...
No description

Not all workers being utilized

In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd
No description

runpodctl command to display serverless endpoint id

Is there a command in the runpod package or runpodctl that outputs the serverless endpoint id?

How to stream via OPENAI BASE URL?

Does the OPENAI BASE URL support Server-sent events (SSE) type of streaming? I was working previously with Ooba streaming was working fine. Since we switched to vLLM/Serverless it is no longer working. If this is not done via SSE, Is there perhaps any tutorial you could recommend how to achieve streaming, please?...