Runpod•3mo ago

Serverless Endpoint with vLLM (Qwen2.5-VL-3B-Instruct)

I’m trying to set up a Serverless Endpoint on RunPod with vLLM (with Qwen2.5-VL-3B-Instruct). My goal is to get a lot of images descriptions. Here is how i set it up: Docker Image:

runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0

runpod/worker-v1-vllm:v2.7.0stable-cuda12.1.0

GPU:

48GB pro
(L40, L40S, or RTX 6000 ADA)

48GB pro
(L40, L40S, or RTX 6000 ADA)

ENV vars:

MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=1
QUANTIZATION=bitsandbytes
LIMIT_MM_PER_PROMPT=image=10,video=0

MODEL_NAME=Qwen/Qwen2.5-VL-3B-Instruct
DOWNLOAD_DIR=/runpod-volume
DTYPE=float16
GPU_MEMORY_UTILIZATION=0.90
ENABLE_PREFIX_CACHING=1
QUANTIZATION=bitsandbytes
LIMIT_MM_PER_PROMPT=image=10,video=0

This call with one image works :

curl https://api.runpod.ai/v2/xxxxxxxx/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer rpa_xxxxxxxxxxxxxxxx" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://www.site.com/images.jpeg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1000,
    "temperature": 0.7
  }'

curl https://api.runpod.ai/v2/xxxxxxxx/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer rpa_xxxxxxxxxxxxxxxx" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://www.site.com/images.jpeg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1000,
    "temperature": 0.7
  }'

Now I have several questions. Is it worth passing multiple images to the model in a single call? Will it be more efficient? If so, how should I pass the parameters? Did I miss anything in the ENV vars that would be important to go faster? Thank you very much for any help or tips you can give me.

1 Reply

Unknown User•3mo ago

Message Not Public

Gaming

Programming

Serverless Endpoint with vLLM (Qwen2.5-VL-3B-Instruct)

Did you find this page helpful?