Runpod•13mo ago

vllm worker OpenAI stream timeout

OpenAI client code from tutorial (https://docs.runpod.io/serverless/workers/vllm/openai-compatibility#streaming-responses-1) is not reproducible.
I'm hosting 70B model, which usualy has ~2 mins delay for request.
Using openai client with stream=True timeouts after ~1 min and returns nothing. Any solutions?

wiki•12/11/24, 11:22 PM

Did you set model name ?

wiki•12/11/24, 11:22 PM

Or it was as it is MODEL_NAME?

MisterionOP•12/12/24, 7:36 AM

MODEL_NAME is huggingface link as usual

MisterionOP•12/12/24, 11:38 AM

basically what I experience there is that server closes the connection after ~ 1 min in case stream == True, non-streaming works fine

MMisterion MODEL_NAME is huggingface link as usual

Jason•12/12/24, 12:42 PM

Eh isn't it the model repo only like
meta-llama/llama3.3-70b
something like that

MisterionOP•12/12/24, 1:37 PM

yes this is what I meant, sorry

MisterionOP•12/12/24, 1:38 PM

I'm not sure how does MODEL_NAME affect this problem at all

Jason•12/12/24, 1:53 PM

Maybe just the environment variable key name

Jason•12/12/24, 1:54 PM

Maybe was only checking for that

MMisterion OpenAI client code from tutorial (https://docs.runpod.io/serverless/workers/vllm...

Jason•12/12/24, 1:54 PM

But if not using stream does it works?

MisterionOP•12/12/24, 2:14 PM

Yes, this waits for the whole request to finish.

client = OpenAI(
    base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1", api_key=api_key
)

stream = client.chat.completions.create(
    model="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": "Say hello!",
        },
    ],
)

client = OpenAI(
    base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1", api_key=api_key
)

stream = client.chat.completions.create(
    model="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": "Say hello!",
        },
    ],
)

Adding stream=Truestream=True, sends the request which I can see in the dashboard, but it terminates the connection after ~1 min.

Jason•12/12/24, 2:18 PM

Oh hmm

Jason•12/12/24, 2:18 PM

And empty response? Nothing streamed back?

Jason•12/12/24, 2:20 PM

If you replicate your vllm config in a pod, try it if it works with streaming and try active workers too

Jason•12/12/24, 2:21 PM

I'm guessing it might be the cloudflare proxy limiting a request to a 100s only

JJason And empty response? Nothing streamed back?

MisterionOP•12/12/24, 2:22 PM

Nope

Jason•12/12/24, 2:22 PM

If you want you can create a ticket too to explore more on this

MMisterion OpenAI client code from tutorial (https://docs.runpod.io/serverless/workers/vllm...

PoddyAPP•12/12/24, 2:23 PM

@Misterion

Escalated To Zendesk

The thread has been escalated to Zendesk!

Justin•12/19/24, 6:43 AM

Same issue here but even without streaming

client = OpenAI( base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1", api_key=api_key ) stream = client.chat.completions.create( model="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8", messages=[ { "role": "user", "content": "Say hello!", }, ], )

vllm worker OpenAI stream timeout

Similar Threads

Similar Threads

Similar Threads