Runpod•3mo ago

Serverless Load-balancing

Good morning, I've recently came across https://docs.runpod.io/serverless/load-balancing/overview and following the instrucions. Yet, when I attempted to make a external HTTP request using n8n it simply did not work. I've attached my works logs below. Please let me know if I've done something wrong. Or. It's a possible issue with the documentation. Note) I used the following Container Image: runpod/vllm-loadbalancer:dev

logs_19.txt

144 Replies

Henky!!•3mo ago

This is not live yet to my knowledge

ProfessorOP•3mo ago

It's available on the website though? Within the Manage API interface. 🙂

Henky!!•3mo ago

Where? Been asking for access Can you post a screenshot?

ProfessorOP•3mo ago

ProfessorOP•3mo ago

Henky!!•3mo ago

How do you gst there?

ProfessorOP•3mo ago

Select your serverless instance, > Manage > It's at the top 🙂

Henky!!•3mo ago

I dont have one but let me see what happens if I try

ProfessorOP•3mo ago

Sounds good, let me know! 🙂

Henky!!•3mo ago

Oh now I got hit by the update 😄 I doubt my software is compatible yet but it means I can work on that this week 😄

ProfessorOP•3mo ago

Haha that's good, I can't understand how it works. I've followed the documentation yet it does not work lol.

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I'm also a little stuck. How do I get container logs?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I get http error 401 but I cant see any logs other than "worker is ready"

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Yes but we dont have that implemented yet Wont route to them if that errors?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Runpod is very arbitrary with the ping thing So right now the build I can use for testing will 404 on the url If that means it wont route to it mine wont work yet I was hoping it would briefly work I dont have a publically downloadable dev build at 3am haha But will postpone it for now and try again when I have time

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I was eager to try it

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

But not so eager I am gonna upload a test binary xD

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

The lack of a log is odd though Something not stable for the public that does have that endpoint

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Weve implemented it but the regular dev builds are zipped and behind an acc lock so cant use them from my phone In my case the model downloads during load, its not baked. Is container storage persistent on serverless or will this likely require a network volume?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

What if its idle?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Idle clears out?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Because I never understood flash boot

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Flash boot sounds like blackmagic to me

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Like its a 2 minute model download usually, if that happens often network storage makes sense. If its usually cached with flash boot i'm fine not adding any In my experience we download faster from hf than we load from network storage

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Or at least much faster than writing to network storage xD

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

The savjng is waaaay slower IO in general seems to be I doubt we hit those 400mb/s Or is it only the writing thats slow?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

On saving yes But on load?

ProfessorOP•3mo ago

Was you able to figure it out @Henky!! ? 🙂

Henky!!•3mo ago

No it looks like we need that /ping endpojnt which is impossible at 3am I am gonna sleep

ProfessorOP•3mo ago

Of course haha, it's currently 2am for me I'll keep working til 5/6am. I'm currently forking worker-sglang. 🙂

Henky!!•3mo ago

I do think my workers may be running but since it doesnt get a 200 its not sending jobs to them

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Keep in mind those are likely for the classic worker type, this load balancer is brand new so nobody got their hands on it yet

ProfessorOP•3mo ago

Almost certainly, I was able to send a HTTP request, and the workers noticed it and changed to running But it didn't get pass that.

Henky!!•3mo ago

I dont even get that far No log not running nothing happens Completely dead url

Unknown User•3mo ago

Message Not Public

ProfessorOP•3mo ago

Correct, it downloaded the image, but then just died lol.

Henky!!•3mo ago

Unknown User•3mo ago

Message Not Public

ProfessorOP•3mo ago

It's as if the docker image provided is not setup to work fully, Docker Image: runpod/vllm-loadbalancer:dev

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

For /ping right?

ProfessorOP•3mo ago

It outputted neither, for /ping or /generate. The HTTP request kinda just died. I'll redeploy it now, to hopefully give better insight. 🙂

Henky!!•3mo ago

What if in my case theres no response at all but then once loaded we send 200. I assume thats fine Because our webserver begins working after the model loads

ProfessorOP•3mo ago

What port did you use?

Henky!!•3mo ago

5001 but I am not using vllm

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I am trying koboldcpp Yes its 5001 for me on both but for ping it wont return a valid as we dont have that yet

ProfessorOP•3mo ago

If I recall looking into the docs, the port is 5000 no? if name == "main": import uvicorn port = int(os.getenv("PORT", "5000")) uvicorn.run(app, host="0.0.0.0", port=port)

Henky!!•3mo ago

In my own image its 5001

ProfessorOP•3mo ago

Oh right, 🙂

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I'm trying to get a full UI in here haha

ProfessorOP•3mo ago

I think it's hard coded? Possibly there is var for it though.

Henky!!•3mo ago

ProfessorOP•3mo ago

Port 80, 🤔

Unknown User•3mo ago

Message Not Public

ProfessorOP•3mo ago

I wonder what port 5000 is, I'll check the code now haha 🙂

Henky!!•3mo ago

Thing is why am I not getting any logs about the page load?

ProfessorOP•3mo ago

Ah port 5000 is regarding FastAPI. It listens for incoming requests.

Henky!!•3mo ago

Shouldnt it at least show I tried but failed?

Unknown User•3mo ago

Message Not Public

ProfessorOP•3mo ago

You change the Worker* Docker Image ?

ProfessorOP•3mo ago

I used https://hub.docker.com/layers/runpod/vllm-loadbalancer/dev/images/sha256-68554c9c55e3c7737427b24997821423a64c980c4cd63f5918d4ae1d692882cb

Henky!!•3mo ago

Normally thats a default command in the docker, the pods just run it So I assume this is the same and the default start command would be triggered Anyway I give up for tonight, when I have time and actually have access to my dev tools I can do a proper attempt

ProfessorOP•3mo ago

Night dude, I'll keep trying if I figure it out I'll post in here 🙂

ProfessorOP•3mo ago

I revisited the documentation and spotted something important (see attached image). I’ve successfully forked the preconfigured repository and initiated a build on RunPod using the fork. The worker is currently deploying. I’ll share updates once it’s live. Sources: https://docs.runpod.io/serverless/load-balancing/vllm-worker https://github.com/runpod-workers/vllm-loadbalancer-ep

ProfessorOP•3mo ago

Even after using the proper template I'm still getting the following error when sending a HTTP request:

INFO 08-06 03:04:04 [init.py:244] Automatically detected platform cuda. Traceback (most recent call last): File "/src/handler.py", line 13, in <module> from utils import format_chat_prompt, create_error_response File "/src/utils.py", line 3, in <module> from .models import ChatMessage, ErrorResponse ImportError: attempted relative import with no known parent package

Unknown User•3mo ago

Message Not Public

ProfessorOP•3mo ago

https://github.com/Daniel-Farmer/vllm-loadbalancer-ep/

GitHub

GitHub - Daniel-Farmer/vllm-loadbalancer-ep

Contribute to Daniel-Farmer/vllm-loadbalancer-ep development by creating an account on GitHub.

ProfessorOP•3mo ago

I've made two small edits to the fork, and it's launching now 🙂

ProfessorOP•3mo ago

Great it works! 🙂

Henky!!•3mo ago

@Dj I cant get it to route at all Just remains a 401 and the workers dont even start @Jason if its alternating between initializing and throttled is that related or does that happen anyway?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I have all regions

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I have 3 + 2 It just doesnt route at all And there wont be a /ping endpoint when it doesnt even start the software The version I hooked up has that though but would happen a minute or so in

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I rather just dm someone one on one Since its so new and I am also building for it thats way easier But I have an idea I added 1 active worker That one I now for the first time see boot up Despite it running the load balancer 401's Oh no I think I found what it is, they have a very big design quirk But luckily its a design quirk that should be fixable Yup my suspicion seems correct That breaks a lot They require the api key auth Ill submit a ticket I guess xD Its kinda doable but it breaks a lot of core functionality in ways we cant fix Even worse it requires a writable api key

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

To make the load balancer route it

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Yes

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I get why they did that because it would cost money to have a http request But its a private url and we have our own auth so I need a toggle that turns that off

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Its load balanced, so if you hit it with a basic http request it has to spin up a worker to reply So if a random spambot hits it they would spin up a worker To prevent that they auth gated it But that destroys my use case Or at least severely cripples Because the whole idea is that users can have their own secret URL endpoint. Bookmark it, and whenever they want to use KoboldCpp they visit the link, instance seamlessly shows up and has the UI Browsers cant bearer auth like thst Told them in advance KoboldCpp is such a complex use case that if it had issues i'd find them quick haha My designer hat came up with a solution that should be super nice, password url's A unique URL you can generate that acts as an auth bearer bypass If one gets compromised you invalidate it

Henky!!•3mo ago

Issue 2:

Henky!!•3mo ago

It has cors restrictions runpod side Just got it confirmed, cors from the origin server is not respected @Jason do you know where I can configure the network mount location?

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

Meh Ill skip that for now, Ticket submitted with all my findings 😄 I'm always the one to find design limitations haha

Dj•3mo ago

Thanks for figuring it out, let me pull the ticket for myself ❤️

Henky!!•3mo ago

#21538

Henky!!•3mo ago

😄

Unknown User•3mo ago

Message Not Public

Henky!!•3mo ago

I released a koboldcpp docker updats that can detect runpod serverless and if present dynamically switches from /workspace to that

Unknown User•3mo ago

Message Not Public

matt•3mo ago

I've been testing this new load-balancing option and it works for the example project but not for my own docker image. when i ping the endpoint I see one worker running but logs do not show anything. any recommendations on base image? I was using an nvidia cuda 12.9 base, suspecting it's too "new" I was right, using 12.1 ubuntu 22.04 base image solved the issue

Usman Yasin•3mo ago

I wanted to use the default vllm image since it has /ping endpoint as well. Managed to make it work after some struggle but one quirk I found out is that if you set 1 active worker and 1 max worker and send a request afterwards, runpod doesnæt route the request to the worker and timesout the request. It only works if there is capacity to create new workers. Sharing the configuration that worked for me

flash-singh•3mo ago

thanks for the feedback here, some things suggested in here make sense, ill see what i can turnaround - allow cors to be set - allow some type of signed urls that can be expired but have no auth this is still an early release, we plan to make it robust as we get more use cases and see what makes sense for users to deploy with serverless, auth is always a first no-brainer, but I do see point around using serverless to publicly share among your team or just loading comfyui, etc

Henky!!•3mo ago

Cors its best if it follows whatever the baclend is doing, that way you always get it right Not sure how implementable that kind of auto is combined with the bearer auth though, but in theory it only needs cors allowance when it goes trough And while I didnt mention it here, as you can probably imagine the value of these requests tie in to the runpod hub. I want to be able to offer this as an on demand app trough the hub where the entire UX runs on runpod. Basically what we do already with the pod but for those prefering it serverless

flash-singh•3mo ago

you can bake auth in but cors is more difficult? cors is easy to implement if thats the case

Henky!!•3mo ago

On my end i'd just like it to mirror the webserver 1:1 hence no auth other than the backend and cors allowed if the backend allows it

Henky!!•3mo ago

Live demo on what I am aiming for https://koboldai-koboldcpp-tiefighter.hf.space

KoboldAI Lite

KoboldAI Lite - A powerful tool for interacting with AI directly in your browser. Chat with AI assistants, roleplay, write stories and play interactive text adventure games.

Henky!!•3mo ago

That demo instance shuts down if its inactive for 1 hour

matt•3mo ago

@flash-singh it's a bit unclear from the docs how the port and health port configuration is supposed to be. does it need to be set as env vars AND docker exposed ports? also would be nice to have the load balancing option on the REST api

flash-singh•3mo ago

you dont have to define any ports if you run your fastapi server on port 80, and /ping on port 80 too then it works as is reason port and port health are separated, in some instances like vllm its not easy to add another /ping endpoint to current fastapi, so you may need to run another thread to run separate /ping fastapi, thats why you can define it as separate you dont need to expose any ports on docker or container side, we handle that automatically, just run the fastapi server on port 80 so 0.0.0.0:80 my curent goal is i can allow override of cors using env variable as an option, e.g. RUNPOD_LB_CORS=*

Henky!!•3mo ago

That kinda works unless someone makes software where only some endpoints need to be cors Whats blocking using the headers from the worker?

flash-singh•3mo ago

what do you mean by this? RUNPOD_LB_CORS would be defined in your env variables when you make the serverless endpoint, our load balancer will follow it, you don't need to make any change to your fastapi

Henky!!•3mo ago

I'm asking why it would have to be defined? The API endpoints already have the correct headers

flash-singh•3mo ago

so some users can block all cors access if they want or allow all

Henky!!•3mo ago

Shouldn't the worker handle that?

flash-singh•3mo ago

i see your point, the HEAD calls should be going to your worker

Henky!!•3mo ago

Yeah, and that way the worker has control which URL is cors and which isn't in case that ever matters

flash-singh•3mo ago

currently every request counts against scale, even a HEAD call would, need to make all HEAD calls not scale anything

Henky!!•3mo ago

Won't work if it then doesn't spin up at all

flash-singh•3mo ago

currently everything is authed so no way to do a proper head call anyway

Henky!!•3mo ago

At least if the auth fails cors becomes irrelevant

flash-singh•3mo ago

yeah true cors will block the actual call until HEAD call passes this is why its best if load balancer handles the cors because its very cheap and can return instantly than sending traffic, these arent your normal fastapi stuff where a worker can spin up a fastapi in < 1 second, due to cold start of model, it can take upwards of ~20 seconds and that spin up can cost 10x more than a normal cpu worker

Henky!!•3mo ago

Technically if you want to go overkill you can cache the cors state of the worker But I don't see the issue was passtrough, HF does passtrough

flash-singh•3mo ago

can do passthrough, not an issue but rather it will wait until worker spins up, which can take a while

Henky!!•3mo ago

Happens on the request either way if it should succeed

flash-singh•3mo ago

yup but we will need to first allow no auth otherwise cors wont work regardless with auth

Henky!!•3mo ago

Fair, and if cors passtrough depends on the no auth URL's that makes sense Your var for the auth version then also makes sense

flash-singh•3mo ago

we actually don't block cors, unless your seeing that, not explicitly at least, whatever your server does should pass through, the problem is auth is blocking it will see what happens once we allow no auth

Henky!!•3mo ago

I see our app choose to use its cors proxy and if we bypass cors limits it doesnt, hence that lead me to belief its like that

flash-singh•3mo ago

i don't think you can control headers for browser cors calls, so browser will initiate a HEAD request without auth and it will fail

Gaming

Programming

Serverless Load-balancing

Did you find this page helpful?