Runpod•9mo ago

I am trying to deploy a "meta-llama/Llama-3.1-8B-Instruct" model on Serverless vLLM

I do this with maximum possible memory.
After setup, I try to run the "hello world" sample, but the request is stuck in queue and I get "[error]worker exited with exit code 1" with no other error or message in log.
Is it even possible to run this model?
What is the problem? can this be resolved?
(for the record, I did manage to run a much smaller model using the same procedure as above)

EErezL I do this with maximum possible memory. After setup, I try to run the "hello wor...

Jason•3/30/25, 6:01 AM

what gpu did you use?

Jason•3/30/25, 6:01 AM

try setting the allow remote code

JJason what gpu did you use?

ErezLOP•3/30/25, 6:02 AM

I tried all of them I think. The storngest possible for sure.

JJason try setting the allow remote code

ErezLOP•3/30/25, 6:02 AM

I did

Jason•3/30/25, 6:03 AM

can i see your request example

ErezLOP•3/30/25, 6:04 AM

is this chioce ok?

Jason•3/30/25, 6:04 AM

maybe

Jason•3/30/25, 6:04 AM

you use the vllm template from runpod?

JJason can i see your request example

Jason•3/30/25, 6:04 AM

your inputs

Jason•3/30/25, 6:05 AM

JJason your inputs

ErezLOP•3/30/25, 6:05 AM

it's the default (I deleted the instace by now)

JJason you use the vllm template from runpod?

ErezLOP•3/30/25, 6:05 AM

yes

ErezLOP•3/30/25, 6:06 AM

How can I choose a GPU? (there is no choice available in the setup process)

JJason you use the vllm template from runpod?

ErezLOP•3/30/25, 6:06 AM

should I be using a different template?

Jason•3/30/25, 6:06 AM

ok let me try

Jason•3/30/25, 6:07 AM

im trying to run that model with runpod's template now

EErezL How can I choose a GPU? (there is no choice available in the setup process)

Jason•3/30/25, 6:07 AM

that one you screenshot works

Jason•3/30/25, 6:07 AM

{
  "delayTime": 70122,
  "executionTime": 896,
  "id": "f83c48a3-3bcd-41e2-8a48-1d2126bbb7b1-e1",
  "output": [
    {
      "choices": [
        {
          "tokens": [
            "! Welcome to my blog about London: the Great City!\nIn this blog you"
          ]
        }
      ],
      "usage": {
        "input": 3,
        "output": 16
      }
    }
  ],
  "status": "COMPLETED",
  "workerId": "wl0u8r3xp794gx"
}

{
  "delayTime": 70122,
  "executionTime": 896,
  "id": "f83c48a3-3bcd-41e2-8a48-1d2126bbb7b1-e1",
  "output": [
    {
      "choices": [
        {
          "tokens": [
            "! Welcome to my blog about London: the Great City!\nIn this blog you"
          ]
        }
      ],
      "usage": {
        "input": 3,
        "output": 16
      }
    }
  ],
  "status": "COMPLETED",
  "workerId": "wl0u8r3xp794gx"
}

Jason•3/30/25, 6:07 AM

my run just succeed with h100

Jason•3/30/25, 6:08 AM

did you have your huggingface token in the endpoint ?

JJason did you have your huggingface token in the endpoint ?

ErezLOP•3/30/25, 6:08 AM

Jason•3/30/25, 6:08 AM

well thats a problem

Jason•3/30/25, 6:08 AM

you need it

JJason my run just succeed with h100

ErezLOP•3/30/25, 6:08 AM

How do I choose a GPU? Where do I even see which GPU I got?

EErezL is this chioce ok?

Jason•3/30/25, 6:09 AM

This here

Jason•3/30/25, 6:09 AM

if you select one, thats all the gpu will be

Jason•3/30/25, 6:09 AM

if you select two, you can see the workers in the other tab

JJason did you have your huggingface token in the endpoint ?

ErezLOP•3/30/25, 6:09 AM

readonly token is ok?

EErezL readonly token is ok?

Jason•3/30/25, 6:10 AM

yeah for downloading

Jason•3/30/25, 6:10 AM

theres a screenshot about how to see your workers

ErezLOP•3/30/25, 6:16 AM

I get an error that I need to ask for access to the model in huggingface

ErezLOP•3/30/25, 6:22 AM

I applied and waiting for approval...

JJason yeah for downloading

ErezLOP•3/30/25, 6:22 AM

Thanks for your time.

Jason•3/30/25, 6:30 AM

Oh i see

Jason•3/30/25, 6:30 AM

okok

EErezL Thanks for your time.

Jason•3/30/25, 6:49 AM

your welcome

ErezLOP•3/30/25, 6:53 AM

I works now. Thanks.

Jason•3/30/25, 8:38 AM

yay

{ "delayTime": 70122, "executionTime": 896, "id": "f83c48a3-3bcd-41e2-8a48-1d2126bbb7b1-e1", "output": [ { "choices": [ { "tokens": [ "! Welcome to my blog about London: the Great City!\nIn this blog you" ] } ], "usage": { "input": 3, "output": 16 } } ], "status": "COMPLETED", "workerId": "wl0u8r3xp794gx" }

I am trying to deploy a "meta-llama/Llama-3.1-8B-Instruct" model on Serverless vLLM

Similar Threads

Similar Threads

Similar Threads