Serverless Load-balancing
Good morning,
I've recently came across https://docs.runpod.io/serverless/load-balancing/overview and following the instrucions. Yet, when I attempted to make a external HTTP request using n8n it simply did not work. I've attached my works logs below. Please let me know if I've done something wrong. Or. It's a possible issue with the documentation.
Note) I used the following Container Image: runpod/vllm-loadbalancer:dev
144 Replies
This is not live yet to my knowledge
It's available on the website though? Within the Manage API interface. π
Where?
Been asking for access
Can you post a screenshot?


How do you gst there?
Select your serverless instance, > Manage > It's at the top π
I dont have one but let me see what happens if I try
Sounds good, let me know! π
Oh now I got hit by the update π
I doubt my software is compatible yet but it means I can work on that this week π
Haha that's good, I can't understand how it works. I've followed the documentation yet it does not work lol.
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I'm also a little stuck. How do I get container logs?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I get http error 401 but I cant see any logs other than "worker is ready"
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Yes but we dont have that implemented yet
Wont route to them if that errors?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Runpod is very arbitrary with the ping thing
So right now the build I can use for testing will 404 on the url
If that means it wont route to it mine wont work yet
I was hoping it would briefly work
I dont have a publically downloadable dev build at 3am haha
But will postpone it for now and try again when I have time
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I was eager to try it
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
But not so eager I am gonna upload a test binary xD
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
The lack of a log is odd though
Something not stable for the public that does have that endpoint
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Weve implemented it but the regular dev builds are zipped and behind an acc lock so cant use them from my phone
In my case the model downloads during load, its not baked. Is container storage persistent on serverless or will this likely require a network volume?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
What if its idle?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Idle clears out?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Because I never understood flash boot
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Flash boot sounds like blackmagic to me
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Like its a 2 minute model download usually, if that happens often network storage makes sense. If its usually cached with flash boot i'm fine not adding any
In my experience we download faster from hf than we load from network storage
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Or at least much faster than writing to network storage xD
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
The savjng is waaaay slower
IO in general seems to be
I doubt we hit those 400mb/s
Or is it only the writing thats slow?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
On saving yes
But on load?
Was you able to figure it out @Henky!! ? π
No it looks like we need that /ping endpojnt which is impossible at 3am
I am gonna sleep
Of course haha, it's currently 2am for me I'll keep working til 5/6am. I'm currently forking
worker-sglang
. πI do think my workers may be running but since it doesnt get a 200 its not sending jobs to them
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Keep in mind those are likely for the classic worker type, this load balancer is brand new so nobody got their hands on it yet
Almost certainly, I was able to send a HTTP request, and the workers noticed it and changed to
running
But it didn't get pass that.I dont even get that far
No log not running nothing happens
Completely dead url
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Correct, it downloaded the image, but then just died lol.

Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
It's as if the docker image provided is not setup to work fully, Docker Image: runpod/vllm-loadbalancer:dev
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
For /ping right?
It outputted neither, for /ping or /generate.
The HTTP request kinda just died. I'll redeploy it now, to hopefully give better insight. π
What if in my case theres no response at all but then once loaded we send 200. I assume thats fine
Because our webserver begins working after the model loads
What port did you use?
5001 but I am not using vllm
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I am trying koboldcpp
Yes its 5001 for me on both but for ping it wont return a valid as we dont have that yet
If I recall looking into the docs, the port is 5000 no?
if name == "main":
import uvicorn
port = int(os.getenv("PORT", "5000"))
uvicorn.run(app, host="0.0.0.0", port=port)
In my own image its 5001
Oh right, π
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I'm trying to get a full UI in here haha
I think it's hard coded?
Possibly there is var for it though.


Port 80, π€
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I wonder what port 5000 is, I'll check the code now haha π
Thing is why am I not getting any logs about the page load?
Ah port 5000 is regarding FastAPI. It listens for incoming requests.
Shouldnt it at least show I tried but failed?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
You change the Worker* Docker Image ?
Normally thats a default command in the docker, the pods just run it
So I assume this is the same and the default start command would be triggered
Anyway I give up for tonight, when I have time and actually have access to my dev tools I can do a proper attempt
Night dude, I'll keep trying if I figure it out I'll post in here π
I revisited the documentation and spotted something important (see attached image).
Iβve successfully forked the preconfigured repository and initiated a build on RunPod using the fork. The worker is currently deploying. Iβll share updates once itβs live.
Sources:
https://docs.runpod.io/serverless/load-balancing/vllm-worker
https://github.com/runpod-workers/vllm-loadbalancer-ep

Even after using the proper template I'm still getting the following error when sending a HTTP request:
INFO 08-06 03:04:04 [init.py:244] Automatically detected platform cuda. Traceback (most recent call last): File "/src/handler.py", line 13, in <module> from utils import format_chat_prompt, create_error_response File "/src/utils.py", line 3, in <module> from .models import ChatMessage, ErrorResponse ImportError: attempted relative import with no known parent package
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
GitHub
GitHub - Daniel-Farmer/vllm-loadbalancer-ep
Contribute to Daniel-Farmer/vllm-loadbalancer-ep development by creating an account on GitHub.
I've made two small edits to the fork, and it's launching now π
Great it works! π

@Dj I cant get it to route at all
Just remains a 401 and the workers dont even start
@Jason if its alternating between initializing and throttled is that related or does that happen anyway?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I have all regions
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I have 3 + 2
It just doesnt route at all
And there wont be a /ping endpoint when it doesnt even start the software
The version I hooked up has that though but would happen a minute or so in
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I rather just dm someone one on one
Since its so new and I am also building for it thats way easier
But I have an idea
I added 1 active worker
That one I now for the first time see boot up
Despite it running the load balancer 401's
Oh no
I think I found what it is, they have a very big design quirk
But luckily its a design quirk that should be fixable
Yup my suspicion seems correct
That breaks a lot
They require the api key auth
Ill submit a ticket I guess xD
Its kinda doable but it breaks a lot of core functionality in ways we cant fix
Even worse it requires a writable api key
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
To make the load balancer route it
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Yes
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I get why they did that because it would cost money to have a http request
But its a private url and we have our own auth so I need a toggle that turns that off
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Its load balanced, so if you hit it with a basic http request it has to spin up a worker to reply
So if a random spambot hits it they would spin up a worker
To prevent that they auth gated it
But that destroys my use case
Or at least severely cripples
Because the whole idea is that users can have their own secret URL endpoint. Bookmark it, and whenever they want to use KoboldCpp they visit the link, instance seamlessly shows up and has the UI
Browsers cant bearer auth like thst
Told them in advance KoboldCpp is such a complex use case that if it had issues i'd find them quick haha
My designer hat came up with a solution that should be super nice, password url's
A unique URL you can generate that acts as an auth bearer bypass
If one gets compromised you invalidate it
Issue 2:

It has cors restrictions runpod side
Just got it confirmed, cors from the origin server is not respected
@Jason do you know where I can configure the network mount location?
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
Meh
Ill skip that for now,
Ticket submitted with all my findings π
I'm always the one to find design limitations haha
Thanks for figuring it out, let me pull the ticket for myself β€οΈ
#21538
π

Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I released a koboldcpp docker updats that can detect runpod serverless and if present dynamically switches from /workspace to that
Unknown Userβ’3w ago
Message Not Public
Sign In & Join Server To View
I've been testing this new load-balancing option and it works for the example project but not for my own docker image. when i ping the endpoint I see one worker running but logs do not show anything.
any recommendations on base image? I was using an nvidia cuda 12.9 base, suspecting it's too "new"
I was right, using 12.1 ubuntu 22.04 base image solved the issue
I wanted to use the default vllm image since it has /ping endpoint as well. Managed to make it work after some struggle but one quirk I found out is that if you set 1 active worker and 1 max worker and send a request afterwards, runpod doesnæt route the request to the worker and timesout the request. It only works if there is capacity to create new workers.
Sharing the configuration that worked for me

thanks for the feedback here, some things suggested in here make sense, ill see what i can turnaround
- allow cors to be set
- allow some type of signed urls that can be expired but have no auth
this is still an early release, we plan to make it robust as we get more use cases and see what makes sense for users to deploy with serverless, auth is always a first no-brainer, but I do see point around using serverless to publicly share among your team or just loading comfyui, etc
Cors its best if it follows whatever the baclend is doing, that way you always get it right
Not sure how implementable that kind of auto is combined with the bearer auth though, but in theory it only needs cors allowance when it goes trough
And while I didnt mention it here, as you can probably imagine the value of these requests tie in to the runpod hub. I want to be able to offer this as an on demand app trough the hub where the entire UX runs on runpod.
Basically what we do already with the pod but for those prefering it serverless
you can bake auth in but cors is more difficult? cors is easy to implement if thats the case
On my end i'd just like it to mirror the webserver 1:1 hence no auth other than the backend and cors allowed if the backend allows it
Live demo on what I am aiming for https://koboldai-koboldcpp-tiefighter.hf.space
KoboldAI Lite
KoboldAI Lite - A powerful tool for interacting with AI directly in your browser. Chat with AI assistants, roleplay, write stories and play interactive text adventure games.
That demo instance shuts down if its inactive for 1 hour
@flash-singh it's a bit unclear from the docs how the port and health port configuration is supposed to be. does it need to be set as env vars AND docker exposed ports? also would be nice to have the load balancing option on the REST api
you dont have to define any ports if you run your fastapi server on port 80, and /ping on port 80 too then it works as is
reason port and port health are separated, in some instances like vllm its not easy to add another /ping endpoint to current fastapi, so you may need to run another thread to run separate /ping fastapi, thats why you can define it as separate
you dont need to expose any ports on docker or container side, we handle that automatically, just run the fastapi server on port 80
so 0.0.0.0:80
my curent goal is i can allow override of cors using env variable as an option, e.g.
RUNPOD_LB_CORS
=*
That kinda works unless someone makes software where only some endpoints need to be cors
Whats blocking using the headers from the worker?
what do you mean by this?
RUNPOD_LB_CORS
would be defined in your env variables when you make the serverless endpoint, our load balancer will follow it, you don't need to make any change to your fastapiI'm asking why it would have to be defined?
The API endpoints already have the correct headers
so some users can block all cors access if they want or allow all
Shouldn't the worker handle that?
i see your point, the HEAD calls should be going to your worker
Yeah, and that way the worker has control which URL is cors and which isn't in case that ever matters
currently every request counts against scale, even a HEAD call would, need to make all HEAD calls not scale anything
Won't work if it then doesn't spin up at all
currently everything is authed so no way to do a proper head call anyway
At least if the auth fails cors becomes irrelevant
yeah true cors will block the actual call until HEAD call passes
this is why its best if load balancer handles the cors because its very cheap and can return instantly than sending traffic, these arent your normal fastapi stuff where a worker can spin up a fastapi in < 1 second, due to cold start of model, it can take upwards of ~20 seconds
and that spin up can cost 10x more than a normal cpu worker
Technically if you want to go overkill you can cache the cors state of the worker
But I don't see the issue was passtrough, HF does passtrough
can do passthrough, not an issue but rather it will wait until worker spins up, which can take a while
Happens on the request either way if it should succeed
yup but we will need to first allow no auth otherwise cors wont work regardless with auth
Fair, and if cors passtrough depends on the no auth URL's that makes sense
Your var for the auth version then also makes sense
we actually don't block cors, unless your seeing that, not explicitly at least, whatever your server does should pass through, the problem is auth is blocking it
will see what happens once we allow no auth
I see our app choose to use its cors proxy and if we bypass cors limits it doesnt, hence that lead me to belief its like that
i don't think you can control headers for browser cors calls, so browser will initiate a HEAD request without auth and it will fail