R
Runpod4mo ago
Campbell

Serverless Instance Queue Stalling

Our team has encountered a pretty consistent issue wherein requests will queue without action to our Serverless endpoints (with the following config) and not be actioned by available idling or throttled instances. Find the original AI help thread here with more details: https://discord.com/channels/912829806415085598/1392974715567607862 Here's some TLDR notes: Background - We have a US-CA-1 network volume hosting a fairly large (~40GB) model file - We deploy on any available 24GB, 40GB, or 80GB card (Pro/non-pro) - There are at least 5 worker slots, and during these bugs, none of them are being utilized - nor are other serverless endpoints running actively - Requests queue for minutes to hours (increased idle time to test), and at times, randomly begin to work - especially if settings are fiddled with (worker count moved up, settings saved, and count moved down, saved). Does not always work however - Once the workers actually start running a request, there is no issue - container loads and request is handled without issue - No issues with Pods, no billing issues. - Cards are always shown as available for this datacenter+network-volume duo during these issues, expanding options to all available cards (120GB) does not fix this. - Only way to avoid this is to use an active worker, which is entirely not ideal as these are intermittent tasks with unplanned schedules. Looking for some assistance on the matter. The model would be difficult to package into the container image due to us regularly building with buildx, which emulates the build environment and thus lengthens the amount of time it takes to build images of that scale to 20+ hours. It is not impossible to build this together, however we would very much like to avoid it - and use the features as intended. Thank you for the help in advance, DC @ Synexis
9 Replies
3WaD
3WaD4mo ago
https://discord.com/channels/912829806415085598/1377137124209463336 might be related. However, it's unusual if you don't see the worker transition to a running state after sending the request, which should occur even when it's stuck in the queue, so users can get billed for it.
Campbell
CampbellOP4mo ago
Unfortunately there is no transition from Queue state to running when this bug occurs, and we've seen it happening around 70% of the time when requests are made. The only alternative has been to keep one worker constantly alive, which is certainly not ideal - at that point, a Pod is cheaper to run. I'd like to avoid costing us money, and taking up valuable datacenter GPU space for other customers.
3WaD
3WaD4mo ago
When the request is sent, do you see it in the request tab in the endpoint dashboard? What is its status? I would suggest capturing that tab along with the workers tab on a video or screenshots and escalating this issue to a support ticket.
Campbell
CampbellOP4mo ago
Yes. I think the main issue here is the disparity between what is shown as Available in the GPUs section, vs. what is actually available. Right now is a bit of an extreme example, but you can see in these screenshots that in US-CA-1, with this network volume attached, there technically is availability - however no workers start period. Thus the request is queued, and goes unanswered. This occurs even when availability is marginally higher, and you do see workers in the Workers tab.
No description
No description
No description
No description
No description
Campbell
CampbellOP4mo ago
As an alternative, our EU failover instance is at high supply, and happily feeds work through. Due to the network-volume datacenter lock, and limited status visibility via API, we cannot easily shift to a multi-NA datacenter endpoint request model without creating massive amounts of delay. In summary, this problem extends into two possible solution paths (assuming I'm not misunderstanding or missing something here): Problem Summary Inadequate GPU availability visibility makes it appear that an endpoint can satisfy a request (GPUs available, >0 ), however platform cannot satisfy request due to unknown issues/constraints. Solution A - Multi-datacenter Network Volumes - Allow for multiple datacenters within the same region to maintain synced network volumes (or, more easily, read-only copies of a source volume - synced on change to master volume), and enable scheduling through a Runpod Endpoint to any datacenters within the replicated array of volumes. Why: Allows for more diverse GPU availability within a region, while maintaining the functionality of network volumes as model caches. Solution B - Correct GPU Availability Metrics and Extend Endpoint Config - Solve the mismatch between GPU availability counts in displays and API endpoints, and allow for configuration of automatic or time-delayed rejection of endpoint requests in the specific event that scheduling cannot occur due to availability. (ie. rejection based on GPU avail + acceptable queue delay time). Why: Corrects the apparent source issue, appropriate SRE + chaos engineering resiliency enhancement, and improves overall availability + experience for end customers. Final thoughts from a DevOps guy My suspicion is that Solution B may be more difficult as it involves understanding one or multiple source causes to the mismatch issue from the orchestrator layer (healthcheck delay, platform desync, etc.), especially given its P2P base. Solution A could be a far more effective short-term, however Solution B would assist with strengthening SOC2 Type 2 reporting
3WaD
3WaD4mo ago
There are no available and idle workers in your screenshots. You have to capture the moment when there are idle workers, and yet the request doesn't go through. Otherwise, it's logical, and they'll tell you just to move the network volume to a different datacenter.
Campbell
CampbellOP4mo ago
@3WaD I'm happy to add that again when I see that occurrence, however the issue described is clear. GPU availability for a given endpoint shows > 0 GPUs available for a given configuration, yet Workers do not deploy - causing our Queue Stall. That could only be described as an unintended discrepancy in the UI and API, and therefore a bug. There is also a bug wherein workers in idle state also fail to execute queued tasks at times - which as soon as I'm shown this again, I will happily add to this ticket. In either case, the one of the two bugs is still presented in detail here. TICKET NOTE Submitted at 6:32PM EST, Jul 10 ID: 20180
3WaD
3WaD4mo ago
I also had big expectations from this company. Let's just say the monitoring of the machines is rather "approximate".
Campbell
CampbellOP4mo ago
I think they're doing a phenomenal job @3WaD. It's important to remember this isn't K8s orchestration, using instead a fairly unique peer to peer networking approach. JM and flash have really kicked ass on getting this to where it is, its a difficult approach with massive upside to datacenter flexibility (thus lower cost to the end user), and still easily one of the best compute providers around town, even after less than <6 years of operation. I've used them now for two different startups of mine - the last of which was acquired primarily due to a cost model only really possible at the time due to the affordability and stability of RunPod, so I really can't say anything but good stuff about how on top of things these folks are. Worked with AWS for >10 years prior, and I don't think I'll ever go back if I can help it. Issues are going to happen, of course - but I'd cut them some slack.

Did you find this page helpful?