R
Runpodβ€’3w ago
Professor

Serverless Load-balancing

Good morning, I've recently came across https://docs.runpod.io/serverless/load-balancing/overview and following the instrucions. Yet, when I attempted to make a external HTTP request using n8n it simply did not work. I've attached my works logs below. Please let me know if I've done something wrong. Or. It's a possible issue with the documentation. Note) I used the following Container Image: runpod/vllm-loadbalancer:dev
144 Replies
Henky!!
Henky!!β€’3w ago
This is not live yet to my knowledge
Professor
ProfessorOPβ€’3w ago
It's available on the website though? Within the Manage API interface. πŸ™‚
Henky!!
Henky!!β€’3w ago
Where? Been asking for access Can you post a screenshot?
Professor
ProfessorOPβ€’3w ago
No description
Professor
ProfessorOPβ€’3w ago
No description
Henky!!
Henky!!β€’3w ago
How do you gst there?
Professor
ProfessorOPβ€’3w ago
Select your serverless instance, > Manage > It's at the top πŸ™‚
Henky!!
Henky!!β€’3w ago
I dont have one but let me see what happens if I try
Professor
ProfessorOPβ€’3w ago
Sounds good, let me know! πŸ™‚
Henky!!
Henky!!β€’3w ago
Oh now I got hit by the update πŸ˜„ I doubt my software is compatible yet but it means I can work on that this week πŸ˜„
Professor
ProfessorOPβ€’3w ago
Haha that's good, I can't understand how it works. I've followed the documentation yet it does not work lol.
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I'm also a little stuck. How do I get container logs?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I get http error 401 but I cant see any logs other than "worker is ready"
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Yes but we dont have that implemented yet Wont route to them if that errors?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Runpod is very arbitrary with the ping thing So right now the build I can use for testing will 404 on the url If that means it wont route to it mine wont work yet I was hoping it would briefly work I dont have a publically downloadable dev build at 3am haha But will postpone it for now and try again when I have time
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I was eager to try it
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
But not so eager I am gonna upload a test binary xD
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
The lack of a log is odd though Something not stable for the public that does have that endpoint
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Weve implemented it but the regular dev builds are zipped and behind an acc lock so cant use them from my phone In my case the model downloads during load, its not baked. Is container storage persistent on serverless or will this likely require a network volume?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
What if its idle?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Idle clears out?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Because I never understood flash boot
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Flash boot sounds like blackmagic to me
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Like its a 2 minute model download usually, if that happens often network storage makes sense. If its usually cached with flash boot i'm fine not adding any In my experience we download faster from hf than we load from network storage
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Or at least much faster than writing to network storage xD
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
The savjng is waaaay slower IO in general seems to be I doubt we hit those 400mb/s Or is it only the writing thats slow?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
On saving yes But on load?
Professor
ProfessorOPβ€’3w ago
Was you able to figure it out @Henky!! ? πŸ™‚
Henky!!
Henky!!β€’3w ago
No it looks like we need that /ping endpojnt which is impossible at 3am I am gonna sleep
Professor
ProfessorOPβ€’3w ago
Of course haha, it's currently 2am for me I'll keep working til 5/6am. I'm currently forking worker-sglang. πŸ™‚
Henky!!
Henky!!β€’3w ago
I do think my workers may be running but since it doesnt get a 200 its not sending jobs to them
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Keep in mind those are likely for the classic worker type, this load balancer is brand new so nobody got their hands on it yet
Professor
ProfessorOPβ€’3w ago
Almost certainly, I was able to send a HTTP request, and the workers noticed it and changed to running But it didn't get pass that.
Henky!!
Henky!!β€’3w ago
I dont even get that far No log not running nothing happens Completely dead url
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Professor
ProfessorOPβ€’3w ago
Correct, it downloaded the image, but then just died lol.
Henky!!
Henky!!β€’3w ago
No description
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Professor
ProfessorOPβ€’3w ago
It's as if the docker image provided is not setup to work fully, Docker Image: runpod/vllm-loadbalancer:dev
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
For /ping right?
Professor
ProfessorOPβ€’3w ago
It outputted neither, for /ping or /generate. The HTTP request kinda just died. I'll redeploy it now, to hopefully give better insight. πŸ™‚
Henky!!
Henky!!β€’3w ago
What if in my case theres no response at all but then once loaded we send 200. I assume thats fine Because our webserver begins working after the model loads
Professor
ProfessorOPβ€’3w ago
What port did you use?
Henky!!
Henky!!β€’3w ago
5001 but I am not using vllm
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I am trying koboldcpp Yes its 5001 for me on both but for ping it wont return a valid as we dont have that yet
Professor
ProfessorOPβ€’3w ago
If I recall looking into the docs, the port is 5000 no? if name == "main": import uvicorn port = int(os.getenv("PORT", "5000")) uvicorn.run(app, host="0.0.0.0", port=port)
Henky!!
Henky!!β€’3w ago
In my own image its 5001
Professor
ProfessorOPβ€’3w ago
Oh right, πŸ™‚
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I'm trying to get a full UI in here haha
Professor
ProfessorOPβ€’3w ago
I think it's hard coded? Possibly there is var for it though.
Henky!!
Henky!!β€’3w ago
No description
Professor
ProfessorOPβ€’3w ago
No description
Professor
ProfessorOPβ€’3w ago
Port 80, πŸ€”
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Professor
ProfessorOPβ€’3w ago
I wonder what port 5000 is, I'll check the code now haha πŸ™‚
Henky!!
Henky!!β€’3w ago
Thing is why am I not getting any logs about the page load?
Professor
ProfessorOPβ€’3w ago
Ah port 5000 is regarding FastAPI. It listens for incoming requests.
Henky!!
Henky!!β€’3w ago
Shouldnt it at least show I tried but failed?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Professor
ProfessorOPβ€’3w ago
You change the Worker* Docker Image ?
Henky!!
Henky!!β€’3w ago
Normally thats a default command in the docker, the pods just run it So I assume this is the same and the default start command would be triggered Anyway I give up for tonight, when I have time and actually have access to my dev tools I can do a proper attempt
Professor
ProfessorOPβ€’3w ago
Night dude, I'll keep trying if I figure it out I'll post in here πŸ™‚
Professor
ProfessorOPβ€’3w ago
I revisited the documentation and spotted something important (see attached image). I’ve successfully forked the preconfigured repository and initiated a build on RunPod using the fork. The worker is currently deploying. I’ll share updates once it’s live. Sources: https://docs.runpod.io/serverless/load-balancing/vllm-worker https://github.com/runpod-workers/vllm-loadbalancer-ep
No description
Professor
ProfessorOPβ€’3w ago
Even after using the proper template I'm still getting the following error when sending a HTTP request:
INFO 08-06 03:04:04 [init.py:244] Automatically detected platform cuda. Traceback (most recent call last): File "/src/handler.py", line 13, in <module> from utils import format_chat_prompt, create_error_response File "/src/utils.py", line 3, in <module> from .models import ChatMessage, ErrorResponse ImportError: attempted relative import with no known parent package
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Professor
ProfessorOPβ€’3w ago
GitHub
GitHub - Daniel-Farmer/vllm-loadbalancer-ep
Contribute to Daniel-Farmer/vllm-loadbalancer-ep development by creating an account on GitHub.
Professor
ProfessorOPβ€’3w ago
I've made two small edits to the fork, and it's launching now πŸ™‚
Professor
ProfessorOPβ€’3w ago
Great it works! πŸ™‚
No description
Henky!!
Henky!!β€’3w ago
@Dj I cant get it to route at all Just remains a 401 and the workers dont even start @Jason if its alternating between initializing and throttled is that related or does that happen anyway?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I have all regions
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I have 3 + 2 It just doesnt route at all And there wont be a /ping endpoint when it doesnt even start the software The version I hooked up has that though but would happen a minute or so in
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I rather just dm someone one on one Since its so new and I am also building for it thats way easier But I have an idea I added 1 active worker That one I now for the first time see boot up Despite it running the load balancer 401's Oh no I think I found what it is, they have a very big design quirk But luckily its a design quirk that should be fixable Yup my suspicion seems correct That breaks a lot They require the api key auth Ill submit a ticket I guess xD Its kinda doable but it breaks a lot of core functionality in ways we cant fix Even worse it requires a writable api key
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
To make the load balancer route it
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Yes
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I get why they did that because it would cost money to have a http request But its a private url and we have our own auth so I need a toggle that turns that off
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Its load balanced, so if you hit it with a basic http request it has to spin up a worker to reply So if a random spambot hits it they would spin up a worker To prevent that they auth gated it But that destroys my use case Or at least severely cripples Because the whole idea is that users can have their own secret URL endpoint. Bookmark it, and whenever they want to use KoboldCpp they visit the link, instance seamlessly shows up and has the UI Browsers cant bearer auth like thst Told them in advance KoboldCpp is such a complex use case that if it had issues i'd find them quick haha My designer hat came up with a solution that should be super nice, password url's A unique URL you can generate that acts as an auth bearer bypass If one gets compromised you invalidate it
Henky!!
Henky!!β€’3w ago
Issue 2:
No description
Henky!!
Henky!!β€’3w ago
It has cors restrictions runpod side Just got it confirmed, cors from the origin server is not respected @Jason do you know where I can configure the network mount location?
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
Meh Ill skip that for now, Ticket submitted with all my findings πŸ˜„ I'm always the one to find design limitations haha
Dj
Djβ€’3w ago
Thanks for figuring it out, let me pull the ticket for myself ❀️
Henky!!
Henky!!β€’3w ago
#21538
Henky!!
Henky!!β€’3w ago
πŸ˜„
No description
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!β€’3w ago
I released a koboldcpp docker updats that can detect runpod serverless and if present dynamically switches from /workspace to that
Unknown User
Unknown Userβ€’3w ago
Message Not Public
Sign In & Join Server To View
matt
mattβ€’3w ago
I've been testing this new load-balancing option and it works for the example project but not for my own docker image. when i ping the endpoint I see one worker running but logs do not show anything. any recommendations on base image? I was using an nvidia cuda 12.9 base, suspecting it's too "new" I was right, using 12.1 ubuntu 22.04 base image solved the issue
Usman Yasin
Usman Yasinβ€’3w ago
I wanted to use the default vllm image since it has /ping endpoint as well. Managed to make it work after some struggle but one quirk I found out is that if you set 1 active worker and 1 max worker and send a request afterwards, runpod doesnæt route the request to the worker and timesout the request. It only works if there is capacity to create new workers. Sharing the configuration that worked for me
No description
flash-singh
flash-singhβ€’3w ago
thanks for the feedback here, some things suggested in here make sense, ill see what i can turnaround - allow cors to be set - allow some type of signed urls that can be expired but have no auth this is still an early release, we plan to make it robust as we get more use cases and see what makes sense for users to deploy with serverless, auth is always a first no-brainer, but I do see point around using serverless to publicly share among your team or just loading comfyui, etc
Henky!!
Henky!!β€’3w ago
Cors its best if it follows whatever the baclend is doing, that way you always get it right Not sure how implementable that kind of auto is combined with the bearer auth though, but in theory it only needs cors allowance when it goes trough And while I didnt mention it here, as you can probably imagine the value of these requests tie in to the runpod hub. I want to be able to offer this as an on demand app trough the hub where the entire UX runs on runpod. Basically what we do already with the pod but for those prefering it serverless
flash-singh
flash-singhβ€’3w ago
you can bake auth in but cors is more difficult? cors is easy to implement if thats the case
Henky!!
Henky!!β€’3w ago
On my end i'd just like it to mirror the webserver 1:1 hence no auth other than the backend and cors allowed if the backend allows it
Henky!!
Henky!!β€’3w ago
Live demo on what I am aiming for https://koboldai-koboldcpp-tiefighter.hf.space
KoboldAI Lite
KoboldAI Lite - A powerful tool for interacting with AI directly in your browser. Chat with AI assistants, roleplay, write stories and play interactive text adventure games.
Henky!!
Henky!!β€’3w ago
That demo instance shuts down if its inactive for 1 hour
matt
mattβ€’3w ago
@flash-singh it's a bit unclear from the docs how the port and health port configuration is supposed to be. does it need to be set as env vars AND docker exposed ports? also would be nice to have the load balancing option on the REST api
flash-singh
flash-singhβ€’3w ago
you dont have to define any ports if you run your fastapi server on port 80, and /ping on port 80 too then it works as is reason port and port health are separated, in some instances like vllm its not easy to add another /ping endpoint to current fastapi, so you may need to run another thread to run separate /ping fastapi, thats why you can define it as separate you dont need to expose any ports on docker or container side, we handle that automatically, just run the fastapi server on port 80 so 0.0.0.0:80 my curent goal is i can allow override of cors using env variable as an option, e.g. RUNPOD_LB_CORS=*
Henky!!
Henky!!β€’3w ago
That kinda works unless someone makes software where only some endpoints need to be cors Whats blocking using the headers from the worker?
flash-singh
flash-singhβ€’3w ago
what do you mean by this? RUNPOD_LB_CORS would be defined in your env variables when you make the serverless endpoint, our load balancer will follow it, you don't need to make any change to your fastapi
Henky!!
Henky!!β€’3w ago
I'm asking why it would have to be defined? The API endpoints already have the correct headers
flash-singh
flash-singhβ€’3w ago
so some users can block all cors access if they want or allow all
Henky!!
Henky!!β€’3w ago
Shouldn't the worker handle that?
flash-singh
flash-singhβ€’3w ago
i see your point, the HEAD calls should be going to your worker
Henky!!
Henky!!β€’3w ago
Yeah, and that way the worker has control which URL is cors and which isn't in case that ever matters
flash-singh
flash-singhβ€’3w ago
currently every request counts against scale, even a HEAD call would, need to make all HEAD calls not scale anything
Henky!!
Henky!!β€’3w ago
Won't work if it then doesn't spin up at all
flash-singh
flash-singhβ€’3w ago
currently everything is authed so no way to do a proper head call anyway
Henky!!
Henky!!β€’3w ago
At least if the auth fails cors becomes irrelevant
flash-singh
flash-singhβ€’3w ago
yeah true cors will block the actual call until HEAD call passes this is why its best if load balancer handles the cors because its very cheap and can return instantly than sending traffic, these arent your normal fastapi stuff where a worker can spin up a fastapi in < 1 second, due to cold start of model, it can take upwards of ~20 seconds and that spin up can cost 10x more than a normal cpu worker
Henky!!
Henky!!β€’3w ago
Technically if you want to go overkill you can cache the cors state of the worker But I don't see the issue was passtrough, HF does passtrough
flash-singh
flash-singhβ€’3w ago
can do passthrough, not an issue but rather it will wait until worker spins up, which can take a while
Henky!!
Henky!!β€’3w ago
Happens on the request either way if it should succeed
flash-singh
flash-singhβ€’3w ago
yup but we will need to first allow no auth otherwise cors wont work regardless with auth
Henky!!
Henky!!β€’3w ago
Fair, and if cors passtrough depends on the no auth URL's that makes sense Your var for the auth version then also makes sense
flash-singh
flash-singhβ€’3w ago
we actually don't block cors, unless your seeing that, not explicitly at least, whatever your server does should pass through, the problem is auth is blocking it will see what happens once we allow no auth
Henky!!
Henky!!β€’3w ago
I see our app choose to use its cors proxy and if we bypass cors limits it doesnt, hence that lead me to belief its like that
flash-singh
flash-singhβ€’3w ago
i don't think you can control headers for browser cors calls, so browser will initiate a HEAD request without auth and it will fail

Did you find this page helpful?