Run Mixtral 8x22B Instruct on vLLM worker
Output guidance with vLLM Host on RunPod
Serverless broke for me overnight, I can't get inference to run at all.
runpod/worker-vllm:stable-cuda12.1.0 in my production app with the model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ. There appears to have been an update in the last 24 hours or so that broke my app completely. I have since spent the last six hours trying to get ANYTHING out of ANY endpoint, and I just can't get anything running. Prior to today, this was running uninterrupted for over a month. I have tried:
- Rolling back to runpod/worker-vllm:0.3.1-cuda12.1.0
- Swapping out models; tried easily 8 or 9 different ones, mostly mixtral variants. I have tried AWQ, GPTQ and unquantized models.
...Please focus on usability

Incredibly long queue for CPU Compute on Toy Example
How to WebSocket to Serverless Pods
Docker build inside serverless
handler.py (See message below)...Running fine-tuned faster-whisper model
Understanding serverless & prising. Usecase: a1111 --api w. control on serverless
Problem with serverless endpoints in Sweden
Serverless Error Kept Pod Active
Is it possible to SSH into a serverless endpoint?
How to authenticate with Google Cloud in a Docker container running Serverless?
When update handler.py using "runpodctl project deploy", Old worker does not auto update handler
Webhook's in runpod
reduce serverless execution time
Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Not all workers being utilized

runpodctl command to display serverless endpoint id
How to stream via OPENAI BASE URL?