R
Runpod•2y ago
Hermann

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? Thank you
14 Replies
Hermann
HermannOP•2y ago
After some try and error, I figured out the solution to 1) that the RUNPOD_API_KEY has no effect. We need to use the actual API KEY that can be generated under accounts -> Settings to access the OpenAI Url.
I'm still not quite certain how to set the model length. I'm getting this error right now:
ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value
ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value
Llama-3 supports 8192 tokens, however I was expecting that it would use RoPE to automatically increase it. Is this not how it's done? RoPE scaling is supported in vLLM: https://github.com/vllm-project/vllm/pull/555
Unknown User
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
Hermann
HermannOP•2y ago
Yeah that is easily done with Aphrodite-engine to increase the model length (by using more memory). vLLM is quite limited. But based on that PR it must be possible, just not so easy I guess.
digigoblin
digigoblin•2y ago
You are right @nerdylive , but its called MAX_MODEL_LEN I don't see how its possible to set the max_model_len to a value thats higher than whats supported by the model, that doesn't make sense to me @houmie @Alpay Ariyak is the best person to advise on this.
Unknown User
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
Hermann
HermannOP•2y ago
In Aphrodite-engine I can set CONTEXT_LENGTH to 16384 and it automatically uses RoPE scaling, in return it requires more memory. See bullet point 3 (https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#notes) I'm using that right now on production. It is really possible 🙂 Guys I really hope you can help me with bullet point 1 about API-KEYS. Is there a way I could define the API-KEY for vLLM myself instead of having RunPod creating it for me? This last one is quite urgent due a migration request.
Unknown User
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
Hermann
HermannOP•2y ago
Of course, happy to help.
Unknown User
Unknown User•2y ago
Message Not Public
Sign In & Join Server To View
Hermann
HermannOP•2y ago
Thank you. And sorry do you know by any chance about the API-KEY issue? I hope there is a way.
digigoblin
digigoblin•2y ago
What is the API key issue? You have to generate an API key in the RunPod web console and use it to make requestes, you can't use a custom API key, you have to use a RunPod one for RunPod serverless to function correctly.
digigoblin
digigoblin•2y ago
This is also pretty clear in the docs: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Hermann
HermannOP•2y ago
I see. Ok, so there is no way to set a custom key. Thanks
digigoblin
digigoblin•2y ago
Nope, not possible, create your own backend as a proxy to serverless if you want to use custom API keys

Did you find this page helpful?