RunPod•11mo ago

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? Thank you

14 Replies

houmieOP•11mo ago

After some try and error, I figured out the solution to 1) that the RUNPOD_API_KEY has no effect. We need to use the actual API KEY that can be generated under accounts -> Settings to access the OpenAI Url.
I'm still not quite certain how to set the model length. I'm getting this error right now:

ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value

ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value

Llama-3 supports 8192 tokens, however I was expecting that it would use RoPE to automatically increase it. Is this not how it's done? RoPE scaling is supported in vLLM: https://github.com/vllm-project/vllm/pull/555

Jason•11mo ago

yes ValueError: User-specified max_model_len (16384) set it on your env max model len to 8192 Oh not sure of how it works

houmieOP•11mo ago

Yeah that is easily done with Aphrodite-engine to increase the model length (by using more memory). vLLM is quite limited. But based on that PR it must be possible, just not so easy I guess.

digigoblin•11mo ago

You are right @nerdylive , but its called MAX_MODEL_LEN I don't see how its possible to set the max_model_len to a value thats higher than whats supported by the model, that doesn't make sense to me @houmie @Alpay Ariyak is the best person to advise on this.

Jason•11mo ago

Ill try to add the support for RoPE

houmieOP•11mo ago

In Aphrodite-engine I can set CONTEXT_LENGTH to 16384 and it automatically uses RoPE scaling, in return it requires more memory. See bullet point 3 (https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#notes) I'm using that right now on production. It is really possible 🙂 Guys I really hope you can help me with bullet point 1 about API-KEYS. Is there a way I could define the API-KEY for vLLM myself instead of having RunPod creating it for me? This last one is quite urgent due a migration request.

Jason•11mo ago

ill try to apply that on vllm worker too Will you try the image to test if it works

houmieOP•11mo ago

Of course, happy to help.

Jason•11mo ago

Alright wait

houmieOP•11mo ago

Thank you. And sorry do you know by any chance about the API-KEY issue? I hope there is a way.

digigoblin•11mo ago

What is the API key issue? You have to generate an API key in the RunPod web console and use it to make requestes, you can't use a custom API key, you have to use a RunPod one for RunPod serverless to function correctly.

digigoblin•11mo ago

This is also pretty clear in the docs: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

houmieOP•11mo ago

I see. Ok, so there is no way to set a custom key. Thanks

digigoblin•11mo ago

Nope, not possible, create your own backend as a proxy to serverless if you want to use custom API keys

Gaming

Programming

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

Did you find this page helpful?