R
Runpod5mo ago
3WaD

Queue Delay Time

What is currently normal delay time? I remember that previously, it was normal to have delay times in milliseconds and container startups were near instant even on the coldest of cold starts. But lately, I have been observing queue delays I don't recognize. Up to the point where even my vLLM image can fully initialize the engine on cold start and compute a full response almost in the same time RunPod takes just to start the container. Similar goes for SDXL, although it is a bit better there. Does container image size affect it? But then why would it happen also on warm requests? This makes it unthinkable to use. Creating a new endpoint or downgrading the RunPod SDK version didn't help. And overall there's nothing I could find that would allow me to influence this further. Plus as I said, it's not happening only on cold starts. The delay on a warm worker is even more extreme as it's sometimes longer than the execution time itself. (See screenshots) *Please note that delay time in this case is truly only the queue delay, as I had to move the initialization (loading models etc.) into the handler where it's counted as execution because I wanted to allow users to change vLLM configuration per-cold start via request payload. Any help would be appreciated since this is a major deal breaker.
134 Replies
3WaD
3WaDOP5mo ago
vLLM coldstarts:
No description
No description
3WaD
3WaDOP5mo ago
SDXL (Fooocus) coldstarts:
No description
No description
3WaD
3WaDOP5mo ago
vLLM warm request:
No description
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
I normally use request count set to 1 as recommended in the docs, but it happens on queue delay too.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
What queue delay times do you observe on your endpoints? I am trying to understand if it's normal or if I am doing something wrong. I am using around ~20GB images with models baked in. I thought it might affect this because of loading from disk, but then why it would happen on warm request too?
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
Maybe. But as I said it's happening on fresh new endpoints too so perhaps they would have to check the whole account 😀
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
I will try storing the models in network volume today to see if there's a change. But it would really help to know if others observe this too
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy5mo ago
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #18647
3WaD
3WaDOP5mo ago
Not yet. I first wanted to know if other users have this problem too.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
River
River5mo ago
So, the. answer here is that we do caching on a best-case basis, and if there is a high amount of demand from other customers, unfortunately, we will have to cache clear more often causing higher delays
3WaD
3WaDOP5mo ago
So, the delay is caused on the job scheduling level in your system and has to be cached in order to be fast? There's nothing I can do on my side to reduce it? How many requests does it usually take to cache it? Because as stated, it's happening also on warm worker requests, right after the previous ones.
River
River5mo ago
The general rule is that smaller models load faster, and it isn't about the amount of requests but the delay between the requests Can I ask how much delay is there between requests on your end?
3WaD
3WaDOP5mo ago
The usage is inconsistent and not super frequent. Especially when in development. That's why we use serverless in the first place, right? @River To make it clear, based on the email you've sent me, these replies, and the RunPod AI (good thing you have at least that), the recommendation is only the standard - to pick different GPUs and use them more frequently? Shouldn't there be something like "delay-based priority GPU selection" that would initialize workers with the highest availability and lowest scheduling time GPUs on the endpoint then? Why we're supposed to observe the availability (which we even can't) and manage this ourselves on a serverless platform? The traffic of others on a specific GPU you picked making your product unavailable doesn't sound very production-ready.
eric.mattmann
eric.mattmann5mo ago
Indeed… we have started looking elsewhere
3WaD
3WaDOP5mo ago
@River I've created a tiny testing container with just the minimal working handler to exclude the size and software of my custom images. It's still happening. On all data centres, on multiple GPU types (I even tested the recommended L40s stated in the email), on all request types (sync, async, openAI), and on a fresh new endpoint. When it's cold-start it takes 6-8s+ just to start the container. Even when you spam the warm worker with requests, the best-case delay is like 10x what it used to be in the past. This state is nowhere near the marketed <250ms (which was true in the past, I remember the same endpoint, with the same container performing like that), and unfortunately unusable for low-latency tasks. Because other users share the same problem and are even leaving the platform because of this, perhaps it deserves a bit of attention.
No description
Dj
Dj5mo ago
@yhlong00000 Wanna take a look at this?
3WaD
3WaDOP5mo ago
So far, I've been able to achieve good delay times only when spamming active workers. But even in that case, the first request is still bad, which doesn't make much sense - the container should already run and be ready, that's what the user pays for.
No description
3WaD
3WaDOP5mo ago
This is the delay when spamming non-active warm workers:
No description
3WaD
3WaDOP5mo ago
Also, one could ask why we care about the cold-start/first request delays so much when spamming a lot of requests seems to be usable. Serverless is supposed to scale to 0 and its core idea is for users/products with unpredictable usage patterns. Very often, especially when RunPod experiences high traffic, the workers don't stay warm for a long time, sometimes they even do only one request and shift away from the endpoint. This can result in most of the requests being cold-starts, and having up to 10s delay time on top of the actual cold-start time of the container is simply not usable for production.
yhlong00000
yhlong000005mo ago
Hey, I totally understand your frustration, let me try to break it down a bit. When we talk about cold starts, there’s actually quite a bit happening behind the scenes. After you send the initial request, our system puts it into a queue and signals a worker (somewhere in a data center) to wake up. That process includes starting the container, loading your model from disk into GPU VRAM, and initializing everything. Once the worker is ready, it pulls the request from the queue and all these contribute to the delay time. So, the larger your image or model, the longer the cold start will be. There’s no real magic to make that happen in just a few hundred milliseconds. For subsequent requests, we use something called Flashboot (or “warm workers” as you’ve probably seen). That helps reduce start time since we can cache things on our end. But as you noticed, it’s not guaranteed, it’s more of a best-effort system. It depends on factors like: • How popular the GPU you’re using is • How frequent your traffic is In general, if you’re on a less popular GPU and sending frequent requests, your worker has a higher chance of being cached, and you could see delay times drop to around 0.25–2 seconds. The goal of serverless is to achieve speed and low cost, but the reality is: if you want fast performance and are running large models, you’ll likely need to keep active workers around, higher idle timeout and that comes with higher costs. Basically: fast, big model, and cheap, you can usually only pick two.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong000005mo ago
The delay time you see in the UI or output reflects the time from when you send the request to when the worker is fully awake and picks it up from the queue.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
*Sigh. It really feels that all the text I am writing doesn't matter. Container cold-start time which includes things like model loading or initialization of the software running on it, and the job queue delay caused by the platform are two different things. You can refer to them as one but there's a huge difference - The user has control over the first. And if you read the previous message before your reply, I showed how I made a minimal handler based on a slim Python image, with no models, no software running on it, and just a dummy response executing in a few milliseconds (see the screenshots), yet it's behaving the same. So stating the job queue delay is caused by a model loading time, or even that the job is started only after the software on the container is initialized seems to be simply wrong. What is not addressed in these generic responses, is why the delay can be 0.25 or as high as 3s I observed on the flashboot warm request. Or the main issue I talked about - why every first request to a new worker stays 6+ seconds in the queue. This can happen even when spamming/using the endpoint very actively or even when having active workers. A single worker can simply shift out (it happens a lot) and once the request is sent to a different one, you get this behaviour. It wasn't happening in the past, and it's not happening on competing platforms, that's why users switch to them. I understand you have a lot of issues to solve and a lot of work to do. But could you please consider treating this issue as one coming from a developer who spends his free time and his own money to make community implementations and promote your platform? It's very hard to continue doing it otherwise.
Xeverian
Xeverian5mo ago
could you name other platforms that have serverless endpoints that run requests from custom docker container? I tried looking for them, but have a hard time finding one
3WaD
3WaDOP5mo ago
I would rather see this resolved than promoting the competition here. But since you asked, the users messaging me about this problem usually stated switching to beam.cloud.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
So it's behaving the same for you. If the delay were around 1 second for all requests, it would be fine. The biggest problem is the first one. As I said, the workers on popular GPUs shift often, so with non-batch usage you can experience this on most of the requests.
zongheng1619
zongheng16195mo ago
I saw all of my works are throttled.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian5mo ago
inconsistent delay and execution types, for the same type of load each time
No description
Xeverian
Xeverian5mo ago
by the way, are we billed for the delay time? Or just the execution time?
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian5mo ago
so the cold start time is already included in execution time shown in the requests tab and we pay exactly for that, is that correct?
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian5mo ago
is it included in the delay time?
3WaD
3WaDOP5mo ago
It's more complicated and confusing. The delay time is job queue delay + anything that runs before the handler function. So if you put e.g. model initialization outside it as recommended in the docs, it's counted as delay time but billed. If you move it to the handle function, it's counted as execution.
Dj
Dj5mo ago
Let me see who I can grab to help you, I personally appreciate your effort in researching this and I want to make sure you're getting the right attention but it's just slightly over my head in terms of the infrastructure itself so it's hard for me to give you an effective/meaningful response. But I'll give it a try. Personally, I think the variance you're seeing looks about what I'd expect. The first run (the bottommost run) will always take the longest as maybe the server has to download the image, load it from our cache (depending on your region), and actually start. Reusing the same image warm, it's good that you see about the same delay/execution time especially testing in a smaller image. Reusing the same image cold, your delay time is also about what I'd expect for us to pull the image from wherever it is (even if it's just on the disk). There's a non zero amount of time it takes to go from your device, to our "aiapi" service, then getting picked up by a given machine that matches your config etc. I don't see where you're seeing the 3s cold start you report, if I missed the screenshot please feel free to show me and if it happened within the last 14 days I can take a look at a specific request based on it's request id (sync-....). There's a tiny bit of variance in our exact server hardware across the fleet. Looking at this image I see: n47dbkspgztynq bzq0frn6fpij6z k25j2gjjf46h31 zebhy1wuvn62u 77qay5aqk97zrd Which all picked up jobs and served inference. These are different machines in a datacenter without our model cache experiment - so for the first load si done by a Docker Pull. I don't see anything strange/unusual about the lifecycle of any of these Pods looking at the logs. For the bzq... outlier, I also don't see why it's being reported as 7.76 seconds unless it's indicating that the job sat in queue behind the request with ID 76e6c9b2... because the Pod was already started.
Dj
Dj5mo ago
The numeric value shown here is what we logged to be the execution time of a given step (in ms)
No description
Dj
Dj5mo ago
3WaD, if you want me to take a similar look at your request lifecycle I can.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj5mo ago
Two different workers :frowning3:
Dj
Dj5mo ago
Just to repost it
No description
Dj
Dj5mo ago
I can check to see if this worker was paused erroneously.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
The container image download is done in the initialization state of the worker, not once it's idle.
Dj
Dj5mo ago
I have no idea why you see this reported to you in this way. :wires:
No description
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj5mo ago
I think there's definitely something here and I would never want to downplay the issues reported here, I just have to have a pretty good understanding Ehhh grey area, there's nothing terrible in here. I removed the column that shows the IP of the machine that served the request xd
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj5mo ago
We share graphs and stuff with enterprise customers and I believe at the very least in private you're entitled to the logs or a deeper understanding of your Worker lifecycle as a paying customer.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
@Dj You can even test it yourself if you want. Go to any serverless endpoint and send any request to a worker for the first time and you should see these long delays. You can emulate worker shifting by eliminating the worker who got that request and is warm. Then send another request, and it will have a long delay again. Do this a couple of times and maybe the backend logs you have access to will tell us why. Edit: It's good to use some minimal template without anything outside the handler function so the delay time is clearly defined and only the job-queue delay is counted.
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
I think it's the c86ccd45-c67d-4e79.. one @Dj were you able to see any information in the logs? River replied to the ticket again today but it looks like he's not reading the Discord threads, so I had to summarize it for the ticket again. It's been two weeks and the problem hasn't moved much 🤔
Dj
Dj5mo ago
@River Just so you're aware this conversation has a Discord thread I haven't checked if we added an link on the ticket
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
To keep this thread informed and synced with the "progress" of the ticket via emails, I was told to upgrade to the latest RunPod SDK 1.7.12. This made things worse, and I was able to break records with a 20+ seconds delay. Can anyone score higher? 😄
No description
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP5mo ago
Still the minimal one
Xeverian
Xeverian5mo ago
well...
No description
Xeverian
Xeverian5mo ago
not the minimal template of course, but not a big llm either. some comfy with basically nothing with it (background removal, other small models) on 4090 machines. Image size is about 12GB. Basically every start is a cold start what frustrates me more is that the execution time varies from say 8 to 40 seconds. Despite the work being basically the same all the time (same request, different input image urls)
Xeverian
Xeverian5mo ago
and delay doesn't help either, yeah
No description
Xeverian
Xeverian5mo ago
It works, alright, but the unpredictability is what makes it all feel bad for the end user as well. I used to do inference on runpod before as well, but with bigger image with flux it was even worse. I moved the image generation to other platform after that (I get the ~10 seconds flat per each generation, which is quite good), but the runpod part that is doing some background removal and upscaling makes people wait for 10-50 seconds more and I'd really like to cut on it
3WaD
3WaDOP4mo ago
If the delay contains the container coldstart it's not fair stats. That's why I am putting everything into the handler if possible, even though not recommended (maybe now we see why 🤨 ). But thank you for demonstrating that these first-hit coldstart delays are important for normal usage. I also saw the difference in execution, mainly that different data centres' GPUs perform a bit differently. For example, the fastest 4090s are in the EU-RO. If you have multiple locations selected maybe that's the case here. Today, I also realised we're being billed for this queue delay time. Based on the docs, the billing starts the moment the scheduler sends a wake-up signal to the worker, not after the container is truly started. For 21x1s requests on 4090/A40, this makes a difference between ~0,007$ and 0,06$. My billing information confirms these numbers. This fact makes this issue even worse. I'll also mention it here for everyone involved or checking this thread - Today, I got confirmation that this problem is reproduced and being solved internally by the dev team. Thanks everyone for your comments.
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian4mo ago
great to know! please share the final result here as well, especially if it involves updating the runpod sdk
Csaba8472
Csaba84724mo ago
hello, so that's why I see completely normal and not normal delay time with kind of similar execution time?
No description
3WaD
3WaDOP4mo ago
It depends on what is running in the container. Suppose it has things executing outside the handler function (e.g. model/engine initialization as the official vLLM template does). In that case, it's also counted into this "delay time", and it's impossible to tell what is queue delay and what is just container cold-start time. But since you have 2-6-minute delays, I would say it's your cold-start time. You would have to move everything into the handler to be sure.
Csaba8472
Csaba84724mo ago
everything is executed from the handler function. this is a cpu worker, it only uses two linux binaries
3WaD
3WaDOP4mo ago
Oof. Ok, so in that case it's probably really platform delay. Looks like you hit a new hard-to-beat record. 😀
Xeverian
Xeverian4mo ago
the real record is here:
No description
Xeverian
Xeverian4mo ago
I was charged $3 for that one 2-hour long cold start. But I've got my refund for it, so it's all good. Never happened after that
Dj
Dj4mo ago
DM me your account email^ :fbslightsmile: @Xeverian
Shaiona
Shaiona4mo ago
Hi I'm getting almost half an hour delay for a single job...
No description
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Shaiona
Shaiona4mo ago
qlyu42bqx3ufrk Also, my credits were getting charged a lot probably from the long delay time. Are we supposed to get charged for the delays?
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP4mo ago
I was also surprised that the billing starts the moment worker gets a signal to start and not AFTER the container actually starts. You would expect not to pay for waiting in a job queue when it's stated "only pay for what you use."
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Shaiona
Shaiona4mo ago
My models were downloaded during build not execution, looking at the logs, it also found the models and didnt redownload
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Shaiona
Shaiona4mo ago
My job was running for 20 secs on rtx 4090 pro but I'm getting charged for almost a dollar So I assume that I'm geting charged for the long delay time which doesn't seem right
3WaD
3WaDOP4mo ago
Yes I confirmed it with this comment above: https://discord.com/channels/912829806415085598/1377137124209463336/1386116834335658116 It's tested even on the most minimal container. We're getting billed for job queue delay. 😕
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Xeverian
Xeverian4mo ago
Noticing delay times spike today as well (one was for a couple of minutes, had to cancel it manually), despite having 5 workers in the idle status. Execution times are higher as well (more cold starts?) RO-1 4090 It was almost perfect just a day ago. Looks like worker turnaround is high (constantly seeing throttled and initializing ones)
3WaD
3WaDOP4mo ago
My guess is, that the platform became underprovisioned for the number of users that came in, or something was screwed up around the time the new UI and all the changes happened. Because in the past everything was quite snappy and just worked. I still have the same endpoint with the same image and everything from back then that was able to return an SDXL txt2img in 11 seconds on coldstarts and 6 seconds warm very stably (including the delay, I'm talking the total request time). Now it looks like this 😕
No description
Xeverian
Xeverian4mo ago
I have really weird gaps in my comfyui worker loading logs that I can't explain. Somethimes they take 15s, sometimes - 0, and everything in-between. And it's always near the "using MLP layer as FFN" line. This makes my execution times unpredictable, as when ther'e no weird gap - is is about 5s, and up to 25s with it. Everything else is completely the same (same tasks, just different image urls). My guess is that something else throttles it? Like CPU is busy or something like that
No description
Xeverian
Xeverian4mo ago
may be. it says 4090 are high supply, but today like 5/10 of my workers are throttled it's all quite frustrating, as I am planning to launch my game in 3 weeks or so. delays and execution times being consistent would help a lot in that regard, even if a bit slower
Xeverian
Xeverian4mo ago
the discrepacy is really crazy. I mean... this is the same request
No description
Xeverian
Xeverian4mo ago
the only difference is that for the 32s one that weird hang/log gap was 25s looks like the slower one was on a throttled worker, which only had 8 vCPUs, while faster ones were unthrottled, 16 vCPUs
3WaD
3WaDOP4mo ago
Yeah, I saw this with 3090s in EU-CZ yesterday. Some workers were straight-up unusable. 2-minute delay and then a timeout or just crashing. With the requirements to become a secure host, one could expect that also the performance of the workers is closely monitored. If you want to dig into this yourself I would recommend typing down the worker IDs with such slow performance and seeing if it repeats on them.
Xeverian
Xeverian4mo ago
so I've purged all the 8vCPU ones and it all became snappy enough
No description
Xeverian
Xeverian4mo ago
I had in impression these worker ids were temporary
3WaD
3WaDOP4mo ago
Heh. Would different location selection help then? Or are those 8vCPU/16vCPU workers in the same location? Are they? That would be quite bad for the user.
Xeverian
Xeverian4mo ago
all RO-1 there are even 6vCPU ones there
3WaD
3WaDOP4mo ago
If the IDs were static, a user could "ban" problematic workers from his endpoint via API and a bit of periodic automation (although the marketing point of serverless is not having to manage your hardware, right? 😆 )
Xeverian
Xeverian4mo ago
so looks like there are "the good" workers
No description
Xeverian
Xeverian4mo ago
and "the bad" (or the ugly, if you prefer) with the same specs
No description
3WaD
3WaDOP4mo ago
Interesting observation. I think it's worth opening a separate issue about this. And I believe there already was a similar one without resolution. Based on the tests, the cold start queue delay issue is present on all workers so far, good or bad. Although bad-performing ones like the 3090s I mentioned can indeed push this effect into extremes.
Dj
Dj4mo ago
The UI is far from the actual daemon and job scoop process. The change to the console would not have affected any other changes to the job processing time. Worker IDs aren't, but you can GET /pods/:id and monitor by the machine id, which is. You don't dislike certain workers, your trouble lies typically with specific machines. I'm really sorry that I can't do much to make this better for either of you immediately, I know it's being worked on and conversations like these help a lot. It's not that we don't see and empathize, it's just hard to provide actional feedback.
3WaD
3WaDOP4mo ago
Isn't /pods/:id for pods? Where can we see the machine ID in serverless?
Dj
Dj4mo ago
A serverless worker in essence is a Pod you didn't configure
3WaD
3WaDOP4mo ago
When I do /pods/{worker-id} I get "pod not found"
Dj
Dj4mo ago
Let me make sure the rest api isn't doing anything weird, but the worker id is effectively just a pod id It should just work, it may be that you're doing it while the pod is disabled. Alternatively https://graphql-spec.runpod.io/#query-pod just to make sure it's not rprest
3WaD
3WaDOP4mo ago
Ok, the GraphQL works and returns machineId so this could be used. I also guess this should be fixed in REST?
Dj
Dj4mo ago
Yeah, I looked through the repo to see if we were doing any weird filtering against the pod type and we aren't but maybe I missed something reading the code on GitHub lol
River
River4mo ago
Okay, I didn't realize DJ handled this! thanks DJ!
Shaiona
Shaiona4mo ago
Hi. Is there a solution to this delay issue? I only have 1 job, set up 3 workers. It keeps staying in queue and charging credits.
3WaD
3WaDOP4mo ago
No update yet. I only know that lead engineer started "looking and fixing" it on 23rd last week.
Barış
Barış4mo ago
Hi @3WaD, hope you're well, I was facing an issue where requests were stuck in queue until they were triggered by new requests (https://discord.com/channels/912829806415085598/1375136211395547246). Although it was a similar but different issue, it got fixed since I selected CUDA version 12.6 or higher in "Endpoint Settings > Advanced > Allowed CUDA Versions". Just wanted to share in case it helps, you probably enabled it already and this issue is a bit different
3WaD
3WaDOP4mo ago
Glad you were able to solve your problem, but unfortunately, CUDA versions are not related to this one.
Barış
Barış4mo ago
Thank you! Ah I see, hope this gets resolved soon too 🙏
3WaD
3WaDOP4mo ago
Quick update: a few days ago, I got some credits back on my account. 👀 There's supposed to be a PR opened for this issue internally, but it seems like the assigned developer has gone on a long vacation afterwards or something.
*"Engineering team is working to fix the issue soon". *
I've clearly expressed that they should reconsider the decision to bill users from the moment the worker gets a wake-up signal, rather than actually starting. This has been
"shared with the product team and engineering leadership team."
I continue to stand on this issue for everyone here until it's solved and the job queue is made faster and fairer. Thanks.
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Unknown User
Unknown User4mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj4mo ago
Light update, confirming we're still working on this - I get a notification for the ticket every once in a while :fbslightsmile: The escalation happened and we have a tentative release time but once I say it it becomes invalid as a law of nature or something >.> Once the ball starts I'll come back and explain the issue some more - or ask Dean to do it himself lol
3WaD
3WaDOP4mo ago
Perfect! Thanks. Do you also have any information about how the team reacted to the idea that users should not pay the GPU rent price while the job is waiting in the queue? 😅
yhlong00000
yhlong000004mo ago
When a request is in the queue and the system signals a worker to wake up, that’s when billing starts. But if you have many requests queued and all workers are busy, you’re not being charged for request waiting in the queue. We only charge from the moment a worker starts processing a request until it finishes.
3WaD
3WaDOP3mo ago
When a request is in the queue and the system signals a worker to wake up, that’s when billing starts.
Thanks for the confirmation. That's what I've said should be reconsidered. It currently takes 6-10 seconds, or even more, from the "wake-up signal" to the container actually starting and doing something. And as you say, it's billed. This means that, for example, CPU containers performing very fast (<1s) jobs can have a 6-10x higher price, and GPU jobs (e.g., startup-optimised vLLM or SDXL framework that can have ~10s cold starts, including task execution) can have a 2x higher price. This is also true for warm workers, even though the delay is lower there.
Dj
Dj3mo ago
An update to our serverless software was just released that affects how quickly we spawn a worker for endpoints with zero running workers. Users with a consistent load will not see the effect of this fix, but users with infrequent requests may now see faster first time to response. In practice, I'm not sure how many users this will actually effect and I'll try to see if I can get a more technical followup from someone working on this. This does not change the actual cold start time, what goes on in your Pods/Workers, or the Pod/Worker lifecycle otherwise, this only targets how quickly after the first request we summon a worker for the user. Specifically, this fix resolves issues like:
Yeah, I saw this with 3090s in EU-CZ yesterday. Some workers were straight-up unusable. 2-minute delay and then a timeout or just crashing
Notice the bold. Even if your Pod is failing due to a driver mismatch (but we have something for this coming soon as well) we'll show you that faster :fbslightsmile:
3WaD
3WaDOP3mo ago
Unfortunately, only half of this announcement might be true. The released fix was focused on some extreme edge scenarios mentioned between the lines of the above conversations. The main queue delay issue remains unresolved, and the workers still behave the same. This is also confirmed after chatting about it with @Dj - see attached screenshots. The ticket support also tried to tell me the issue had been resolved via email, while giving me even more free credits. I informed them that it's not, and asked them for an ETA for a fix that would address the main issue, while reminding that it's been two months since the issue was reported and escalated, and a month since they said it had been reproduced and was being solved internally. I'll leave it up to each of you to decide how the communication up to this point, and the fact that everyone is losing money on every serverless request because of this, makes you feel.
No description
No description
CodingNinja
CodingNinja3mo ago
No description
3WaD
3WaDOP3mo ago
Well, I think we all saw where it's coming. It's been fun years guys. See ya somewhere else and good luck with your projects!
No description
3WaD
3WaDOP3mo ago
I am also attaching a full rewrite of the ticket emails conversation for those interested. 🖖
Barış
Barış3mo ago
Why was 3WaD banned? From what I saw, both he and the Runpod team were doing their best to share feedback and make progress
Dj
Dj3mo ago
The support team (which includes myself) may block users from our support channels (which includes this server) for a variety of reasons. In this case, I'm not certain I can share the specifics just that the decision was not my own.

Did you find this page helpful?