Runpod•2y ago•

44 replies

vLLM streaming ends prematurely

I'm having issues with my vLLM worker ending a generation early. When I send the same prompt to my API without "stream": true, the prompt returns fully. When "stream": true is added to the API, it stops early, sometimes right after {"user":"assistant"} gets sent. It was working earlier this AM, I see this in the system logs around the time that it stopped working:

2024-06-13T15:37:10Z create pod network
2024-06-13T15:37:10Z create container runpod/worker-vllm:stable-cuda12.1.0
2024-06-13T15:37:11Z start container

Was a newer version pushed? I see that there were two new updates pushed in the last 24 hours at the vllm_worker github repo.

haris•6/13/24, 8:18 PM

cc: @Alpay Ariyak

Alpay Ariyak•6/13/24, 8:19 PM

Could you share full output?

Alpay Ariyak•6/13/24, 8:23 PM

Were you streaming w openai compatibility or not?

shensmobileOP•6/13/24, 8:31 PM

I'm using default environment variables, so openai compatibility should be enabled

shensmobileOP•6/13/24, 8:33 PM

So here's my request

shensmobileOP•6/13/24, 8:33 PM

{
"model": "my_model",
"messages": [
{
"role": "user",
"content": "Hi!"
}
],
"stream": true/false
}

shensmobileOP•6/13/24, 8:33 PM

When stream:false

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Hi! How can I help you today?",
"role": "assistant"
},
"stop_reason": null
}
],
"created": 1718310772,
"id": "cmpl-edf2da6230e14a84b6b25861f29591d9",
"model": "S",
"object": "chat.completion",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 13,
"total_tokens": 23
}
}

shensmobileOP•6/13/24, 8:34 PM

When stream:true

data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]}

data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}]}

shensmobileOP•6/13/24, 8:37 PM

Let me know what else I can supply to help

Alpay Ariyak•6/13/24, 8:48 PM

After you send the streaming request and it finishes, can you go to the console and check status of that request, it should show full output from worker, need to see if it’s also cut off there

shensmobileOP•6/13/24, 9:01 PM

The request is too long to past here

shensmobileOP•6/13/24, 9:04 PM

stream_true_request_console_output.json2.83KB

shensmobileOP•6/13/24, 9:04 PM

So in the console/requests log, it looks like the full generation completed.

shensmobileOP•6/13/24, 9:05 PM

It looks like it says "Hello! How can I assist you today?" which completes what Postman received

Sshensmobile So in the console/requests log, it looks like the full generation completed.

Alpay Ariyak•6/13/24, 9:06 PM

Okay, that's great to know, so issue is outside of worker

Alpay Ariyak•6/13/24, 9:06 PM

we're still looking into this

Alpay Ariyak•6/13/24, 9:06 PM

Can you share your entire endpoint configuration

Alpay Ariyak•6/13/24, 9:07 PM

And your endpoint id please

shensmobileOP•6/13/24, 9:07 PM

Is there an easy way for me to export the configuration?

shensmobileOP•6/13/24, 9:08 PM

I have these two:
vllm-nutty_teal_junglefowl
vllm-kejv5lkoiilruc

shensmobileOP•6/13/24, 9:08 PM

I'm not sure which is the endpoint ID

shensmobileOP•6/13/24, 9:08 PM

Also, thank you so much for the help

Alpay Ariyak•6/13/24, 9:08 PM

The second one, I agree its confusing to tell which is the id haha

Alpay Ariyak•6/13/24, 9:08 PM

Of course!

shensmobileOP•6/13/24, 9:11 PM

Can you see the endpoint configuration from the ID?

shensmobileOP•6/13/24, 9:11 PM

Or should I try to copy all of the settings across?

Sshensmobile Or should I try to copy all of the settings across?

Alpay Ariyak•6/13/24, 10:56 PM

Please do for now, I don’t have access atm to the settings

shensmobileOP•6/13/24, 10:58 PM

I'm not sure which settings are important but:

24 GB GPU
3 workers, 1 GPUs/worker
5 second idle timeout
Flashboot enabled

CA-MTL datacenters
12.1,12.2,12.3,12.4 CUDA versions allowed
4 seconds queue delay
L4, A5000, 3090 GPU types

shensmobileOP•6/13/24, 10:59 PM

For the endpoint template:
30 GB container disk
MODEL_NAME: my_model
BASE_PATH: /runpod-volume
HF_TOKEN: my_token

That's all the environment Variables that are set

Alpay Ariyak•6/14/24, 3:05 AM

Is CA-MTL-1 a requirement for you?

Alpay Ariyak•6/14/24, 3:07 AM

This seems isolated to that and US-OR

Alpay Ariyak•6/14/24, 3:07 AM

All others are good

shensmobileOP•6/14/24, 7:36 PM

No, CA-MTL-1 is not a requirement

shensmobileOP•6/14/24, 7:36 PM

I optimally would like ot be in Canada

Alpay Ariyak•6/14/24, 9:25 PM

This was fixed!

Alpay Ariyak•6/14/24, 9:25 PM

Sorry for the delay

shensmobileOP•6/15/24, 12:10 AM

WOOHOO

shensmobileOP•6/15/24, 12:11 AM

Thanks!

shensmobileOP•6/15/24, 12:11 AM

I wonder what happened

Jason•6/15/24, 12:33 AM

me 2

digigoblin•6/15/24, 7:22 AM

me 2

digigoblin•6/15/24, 7:23 AM

Its better to say what the problem was, what was done to fix it etc. Just saying its fixed is honestly not good enough.

shensmobileOP•6/15/24, 4:26 PM

It caused me a lot of grief. I’m very glad it’s fixed but it would be great to get more details and what the mitigation plan will be in the future.

vLLM streaming ends prematurely

Similar Threads

Similar Threads

Similar Threads