R
RunPod4mo ago
jd24

How does the vLLM serverless worker to support OpenAI API contract?

I wonder how a serverless worker can implement a custom API contract, if it is mandatory that the request must be a POST and the payload is forced to be a JSON with a mandatory input field. I understand that the vLLM worker (https://github.com/runpod-workers/worker-vllm) solved it, implements OpenAI API endpoints, but I don´t get how it bypassed these limitation.
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
8 Replies
digigoblin
digigoblin4mo ago
It bypasssed it because it is built by RunPod, so they can add custom endpoints if they need to, but we as users, cannot.
Alpay Ariyak
Alpay Ariyak4mo ago
Its actually possible for everyone, but really hacky, what are you looking to implement it for?
digigoblin
digigoblin4mo ago
Would it be possible to create a doc or blog post on this?
jd24
jd244mo ago
What I want to achieve is a runpod worker (similar to the vLLM one) but for ollama (with streaming support), since this tool allows to load quantized models that cant fit into the avalilable GPU VRAM. For example, I couldn´t load mixtral 8x7b on serverless because via vLLM it takes too much VRAM (it works only with fp16 params) @Alpay Ariyak could you point us to a repo where the hacky solution for vLLM was implemented? I'm surfing on the vLLM worker repo as well as the runpod-python library, but I don´t see where the magic hacky happens
Alpay Ariyak
Alpay Ariyak4mo ago
What happens is when you hit https://api.runpod.ai/v2/<ENDPOINT ID>/openai/abc The handler receives two new key-value pairs in the job["handler"] input: - "openai_route": this will be everything in the link you hit after /openai, so for the example case its value would be /abc , you would use this to tell the handler to do logic for /v1/chat/completions, /v1/models, etc - "openai_input" the openai request as a dictionary, with message, etc if you dont have stream: true in your openai request, then you just return the openai completions/chatcompletions/etc object as a dict in the output (Returned here: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L160) If you have stream: true , then this will be an SSE stream, for which you would yield your output, but instead of yielding the dict directly, you would put it in an SSE stream chunk string format, which is something like f"data: {your json output as string}"\n\n" (Stream code: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L161) Most of the code is in this class in general: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L109 Will work on documentation soon vllm has support for a few quantizations, supported (awq, squeezellm, gptq), do the available options work for you?
jd24
jd244mo ago
Many thanks for the explanation, with these info I believe it would be possible to develop a custom worker with an OpenAI contract. According to my maths, mixtral 8x7b loaded with any of the available options that vllm supports, requires 90+ GB VRAM, which exceed the max VRAM available on the serverless platform (even using 2 GPUs). Also, be aware that loading quantized models with for example q4, allow us to depend on smaller, cheaper and more available hardware
A
A4mo ago
I'm having issues getting this to work it says |openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: WCFGLZ0M**L96G. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}| any help would be greatly appeciated. thank you.
Alpay Ariyak
Alpay Ariyak4mo ago
You did not set the base url