Serverless deepseek-ai/DeepSeek-R1 setup?
How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?
72 Replies
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
Basic config, 2 GPU count


Once it is running, I try the default hello world request and it just gets stuck IN_QUEUE for 8 minutes..
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
yes, but I tried even just following along with the youtube tutorial here and got the same IN_QUEUE problem...: https://youtu.be/0XXKK82LwWk?si=ZDCu_YV39Eb5Fn8A
RunPod
YouTube
Set Up A Serverless LLM Endpoint Using vLLM In Six Minutes on RunPod
Guide to setting up a serverless endpoint on RunPod in six minutes on RunPod.
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
Oh, wait!! I just ran the 1.5B model and got this response:

When I tried running the larger model, I got errors about not enough memory
""Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 3.50 GiB. GPU 0 has a total capacity of 44.45 GiB of which 1.42 GiB is free"
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
So how do I configure ?
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
wow
so it's not really an option to deploy?..
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
I mean, Deepseek offers their own API keys
I thought it could be more cost effective to just run a serverless endpoint here but..
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
hmm.. I see
Thanks for your help
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
Hey @nerdylive i still can deploy the 7B deepseek R1 model right instead of huge model. ?

I am facing this issue
I am not that good in resolving issues.
Did you find a solution ?
Not yet...
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
where should i put this
in envrinment
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
Is the model you are trying to run a GGUF quant? You'll need a custom script for GGUF quants or if there is multiple models in a single repo
I dont understand , this morning I to do a brief test with https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B + a 24gb Vram gpu, but now I got a error cuda memory , do you know guy's how I can fix this issue ?

try 48GB gpu, see if that helps.
Hello there, I increased the max token settings but still getting only the beginning of the thinking, how can I fix that

yep fixed thanks
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Thanks ! Will let you know if it’s work
Yep increase to 3000 but still getting a short " thinking " answer 😦
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
basically used this model casperhansen/deepseek-r1-distill-qwen-32b-awq with vllm and runpod serverless, except lower the model max lenght to 11000 I didnt modify any others settings
my input look like this now :
{
"input": {
"messages": [
{
"role": "system",
"content": "Your are an ai assistant."
},
{
"role": "user",
"content": "Explain llm models"
}
],
"max_tokens": 3000,
"temperature": 0.7,
"top_p": 0.95,
"n": 1,
"stream": false,
"stop": [],
"presence_penalty": 0,
"frequency_penalty": 0,
"logit_bias": {},
"user": "utilisateur_123",
"best_of": 1,
"echo": false
}
}
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Ah ok, do you have an example of correct input for this model ?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
hm im not very familiar with the openai sdk, is it something to configure during the creation of the serverless endpoint ( with vllm ) ?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Nice thanks you for theses infos
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Yep, I basically create a template from https://github.com/runpod-workers/worker-vllm then modify models etc. from env variables right and also modify the few lines of code for be able to call the openai api
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Can you check the cloudflare proxy (not in serverless) for vllm openai compatible servers? Batched requests keep getting aborted only on proxied connections (not on direct using tcp forwarding(?)).
Related Github Issue: https://github.com/vllm-project/vllm/issues/2484
When the problem happens, the logs look something like this:
GitHub
Aborted request without reason · Issue #2484 · vllm-project/vllm
Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output. Howe...
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
1. doesnt abort on streaming requests
2. about 16K tokens?
3. Its in langchain's vllm openai-compatible api sdk (just sends <batch size> requests to the api endpoint at the same time
Also that sdk in langchain doesn't support streaming requests in batch mode
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Can i do it tomorrow..?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
yeah i think so too
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
on that github issue,
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ppl have problems with nginix or some kind of proxy in front of the server
unfortunately i removed the endpoint & pod with the issue
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
thx for the info!
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
It was a cloudflare problem that's on the blog here.
https://blog.runpod.io/when-to-use-or-not-use-the-proxy-on-runpod/
btw does serverless use cloudflare proxies too?
RunPod Blog
When to Use (Or Not Use) RunPod's Proxy
RunPod uses a proxy system to ensure that you have easy accessibility to your pods without needing to make any configuration changes. This proxy utilizes Cloudflare for ease of both implementation and access, which comes with several benefits and drawbacks. Let's go into a little explainer about specifically how the
If so, how do i run long-running requests on serverless without streaming?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
you can stream serverless without worrying about request times, look into streaming section, also serverless max timeout is 5mins, proxy is about 90s
Is they’re any difference between using the fast deployment > vllm or using pre built the docker image
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Yep exact but you can also pre configure the pre built docker image from the env variables right ?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Ok 🙂 about my issue with DeepSeek distilled r1, seems the prompt system is weird and tricky to use, if anyone know a good uncensored model to use vllm let me know ( I’m using llama 3.3 but it’s too censored )
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
is it a finetuned from llama model ?
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
ok:)
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Thanks, will try the cognitivecomputations/Dolphin3.0-Llama3.2-3B