We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join Server

Can I emulate hitting serverless endpoints locally?

So far I've been testing my runpod serverless locally by executing the python handler
python -u
python -u
but is there any way to emulate hitting the serverless endpoint locally?...

All 27 workers throttled

Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...
𝓕𝓻𝓲𝓮𝓷𝓭𝓵𝔂 💖 𝓣𝔂𝓻𝓮𝓼𝓮2/21/2024

I'm using SDXL serverless endpoint and sometimes I get an error.

error message is this:
RuntimeError: expected scalar type Float but found Half, Stack Trace: <traceback object at 0x7f779ace2a00>
RuntimeError: expected scalar type Float but found Half, Stack Trace: <traceback object at 0x7f779ace2a00>

API Wrapper

curl -X POST \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' \ -d '{"input": {"prompt": "a cute magical flying dog, fantasy art drawn by disney concept artists"}}' ...

Deploy from docker hub stuck

I have a basic even-odd container which takes the number input and reponds if it's even or odd, I have uploaded the container to docker hub: phmagic/runpod-test:latest. When I go to set up a new serverless pod, it asks for the container image and i put in phmagic/runpod-test:latest. All request hang for more than 400s, I can't seem to get even the basic example to work, documentation is very spotty about how to do this

Serverless on Active State behaviour

Some APIs I was using on serverless were working on active and idle state before, now it seems to break the server when I switch to active, the response is always the same as the one before, or only finished. I want to debug what is happening, can someone explain how the state work internally on the handler after it's awake? What will stay in memory? ...

LLM inference on serverless solution

Hi, need some suggestion on serving LLM model on serverless. I have several questions: 1. Is there any guide or example project I can follow so that can infer effectively on runpod serverless? 2. Is it recommended to use frameworks like TGI or vLLM with runpod? If so why? I'd like maximum control on the inference code so I have not tried any of those frameworks Thanks!...

Serverless Pricing

Is the delay time also included in the charges? Is there a way to know the total time the worker was operating, excluding the delay time and execution time? Because, I want to charge my customers for the total time they use my service....
There isn't really an accurate way of determining cold start time + execution time automatically unfortunately. You have to look at the metrics for your endpoint and try to determine a base line.

Broken serverless worker - can't find GPU

Serverless worker qbw30nmknd6cmh is broken can't can't find the GPU. ```json { "dt":"2024-02-19 23:34:37.252459" "endpointid":"qbw30nmknd6cmh"...

How does multiple priorities for GPUs assign to me workers?

Wondering what is the algorithm behind selecting gpus when i have like 3 selected; and also if for example 4090s is my first priority even if it throttles like 7/10 of my workers it seems to keep it there. So i reassigned the priorities and reset my workers to see if I get a better distribution to not rely so heavily on 4090s but im wondering then what is the algorithm even doing with these priorities?...
No description

Runpod api npm doesn't work

I'm following to call runpod with npm api package ``` const sdk = require('api')('@runpod/v1.0#18nw21lj8lwwiy'); sdk.healthCheck({endpoint_id: 'yy'})...

How do I expose my api key and use CORS instead?

I want to make it so that all requests from a domain to my serverless endpoint are allowed. I suppose I don't mind exposing my api key if I can make it so that only requests from a certain domain are allowed, right? How would I do this? I want to serve a Comfy workflow on a serverless endpoint and I think I can use to set up the endpoint itself. It would be really helpful if a) someone could let me know if this is possible, and if so b) outline the general steps I need to do to accomplish it....

Worker Errors Out When Sending Simultaneous Requests

I was benchmarking a serverless endpoint by sending 10 simultaneous requests to the endpoint that has two active workers and one of the workers keeps errors out with the attached stack trace. After this error happens I get 9 requests that become stuck In Progress and if I terminate the errored out worker and spin up a new one I get the same stack trace unless I manually clear out the In Progress requests. This endpoint is using a Llama2 70B model with image runpod/worker-vllm:0.2.3...
Figured my issue out. I needed MAX_CONCURRENCY set to 5, otherwise all requests were going only to one node.

Quick Deploy Serverless Endpoints with ControlNet?

Are there currently any quick deploy serverless endpoints with ControlNet? Or would it require a custom docker image?

Mixtral Possible?

Wondering if it's possible to run AWQ mixtral on serverless with good speed

Estimated time comparison - Comfy UI

Hi everyone, I've been looking at the various different GPU options for serverless and I am trying to see if anyone has a rough estimate of how many times each GPU would be faster/slower and if this is even possible to calculate. There is no actual formula obviously but I am wondering if someone has similar experiences. In my case, it takes around 355 seconds to run my workflow on my local machine (GTX 3080 Ti)....

Any plans to add other inference engine?

Hi I'm using vllm worker now but when it comes to quantized models vllm works poorly. Too many vram usage, slow inference, poor output quality, etc.. So, is there any plans to add other engines like tgi, exl2?...

Are there any options to retrieve container logs via API?

Need it for monitoring purposes.

Serverless scaling

I'm considering using runpod for commercial use. I need reliable, relatively cheap scaling for this to work, but ive heard that at least a few months ago serverless was very unreliable, i.e. not allocating GPUs for hours or days at a time. I don't want to have to figure out how to deploy on runpod just to realize that it's unreliable. What is your take on this right now? Any evidence that these problems have been fixed?