Rundpod VLLM Cuda out of Memory

A

ashley•1/15/24, 8:36 PM

Which Mixtral model?

C

ConceptOP•1/15/24, 8:37 PM

mistralai/Mixtral-8x7B-v0.1

A

ashley•1/15/24, 8:38 PM

Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

A

ashley•1/15/24, 8:38 PM

This version is also uncensored

Aashley Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at usi...

C

ConceptOP•1/15/24, 8:42 PM

Does this look correct?

CConcept Hi I've been using the default runpod VLLM template with the mixtrial model load...

J

J.•1/15/24, 8:47 PM

Mixtral is just too much of a memory hog

CConcept Does this look correct?

A

ashley•1/15/24, 8:50 PM

Yeah looks fine

C

ConceptOP•1/15/24, 8:52 PM

2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module>
2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine()
2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^
2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init
2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm()
2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T20:52:23.810576492Z raise e
2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config))
2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, kwargs)
2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs)
2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method)
2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers
2024-01-15T20:52:23.812863558Z self._run_workers(

C

ConceptOP•1/15/24, 8:53 PM

Looks like I'm getting a disk quota exceeded

A

ashley•1/15/24, 8:54 PM

Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?

CConcept Does this look correct?

J

J.•1/15/24, 8:57 PM

prob increase ur container volume higher too - 5 is tiny

A

ashley•1/15/24, 8:58 PM

Not necessary if the environment variables are set correctly

A

ashley•1/15/24, 8:58 PM

5GB is enough

C

ConceptOP•1/15/24, 8:59 PM

I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.

C

ConceptOP•1/15/24, 9:00 PM

C

ConceptOP•1/15/24, 9:00 PM

My jobs are getting stuck here but the model is loading in fine.

A

ashley•1/15/24, 9:01 PM

Use GPTQ not AWQ

A

ashley•1/15/24, 9:01 PM

I sent you this one

TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ not sure why you changed it to

AWQ

AWQ

AWQ

AWQ

C

ConceptOP•1/15/24, 9:02 PM

Changed it because of this lol oops

C

ConceptOP•1/15/24, 9:04 PM

A

ashley•1/15/24, 9:04 PM

Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet

C

ConceptOP•1/15/24, 9:06 PM

IT says the same with gptq

C

ConceptOP•1/15/24, 9:06 PM

A

ashley•1/15/24, 9:07 PM

Oh okay, my bad sorry, AWQ is probably better then.

C

ConceptOP•1/15/24, 9:27 PM

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWQ Seems to be faulty too. CUDA seems to be breaking.

A

ashley•1/15/24, 9:31 PM

trust_remote_code

trust_remote_code

trust_remote_code

trust_remote_code needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.

C

ConceptOP•1/15/24, 9:31 PM

I don't see that as an enviornment var in the readme

A

ashley•1/15/24, 9:32 PM

Might need to fork it and add it yourself

. Are you still using 48GB GPU tier?

C

ConceptOP•1/15/24, 9:32 PM

Yes.

C

ConceptOP•1/15/24, 9:32 PM

2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:32:29.587459992Z Traceback (most recent call last):
2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module>
2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine()
2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^
2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init
2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm()
2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:32:29.588049626Z raise e
2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config))
2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, kwargs)
2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:32:29.589179321Z return engine_class(*args, kwargs)
2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method)
2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:32:29.589570340Z self._run_workers(
2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, kwargs))
2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch
2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs)
2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model
2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device)
2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device
2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device)
2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init
2024-01-15T21:32:29.590940554Z torch._C._cuda_init()
2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

C

ConceptOP•1/15/24, 9:33 PM

Tried using base default mistral and getting CUDA errors still lol.

A

ashley•1/15/24, 9:36 PM

Probably related to

trust_remote_code

trust_remote_code

trust_remote_code

trust_remote_code, it has to be true for Mixtral.

C

ConceptOP•1/15/24, 9:36 PM

It worked before which is super weird.

C

ConceptOP•1/15/24, 9:37 PM

The CUDA error was also for mistral

A

ashley•1/15/24, 9:37 PM

Oh yeah thats strange then, didn't realise it was working

A

ashley•1/15/24, 9:37 PM

Which mistral model?

C

ConceptOP•1/15/24, 9:38 PM

mistralai/Mistral-7B-v0.1

A

ashley•1/15/24, 9:39 PM

Thats a pretty small model so there shouldn't be issues

C

ConceptOP•1/15/24, 9:46 PM

C

ConceptOP•1/15/24, 9:46 PM

Yeah not sure how this is giving me CUDA errors.

A

ashley•1/15/24, 9:48 PM

Probably need to log a Github issue for it.
https://github.com/runpod-workers/worker-vllm/issues

GitHub

Issues · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - Issues · runpod-workers/worker-vllm

C

ConceptOP•1/15/24, 9:58 PM

Still just keep getting st uck at this stage.

A

antoniog•1/16/24, 2:05 PM

Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting

GPU_MEMORY_UTILIZATION

GPU_MEMORY_UTILIZATION

GPU_MEMORY_UTILIZATION

GPU_MEMORY_UTILIZATION variable to 0.90.

A

antoniog•1/16/24, 2:07 PM

One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable

WORKER_CUDA_VERSION

WORKER_CUDA_VERSION

WORKER_CUDA_VERSION

WORKER_CUDA_VERSION to 12.1

A

antoniog•1/16/24, 2:09 PM

I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)

A

ashley•1/16/24, 2:10 PM

Yeah we need the CUDA version filter for serverless like GPU cloud has.

Aantoniog Hey! I had a similar issue with loading awq models with this worker. I resolved ...

C

ConceptOP•1/17/24, 7:54 PM

Is that an environment vaariable?

C

ConceptOP•1/18/24, 5:16 PM

Just kidding found it. @antoniog Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.

J

Justin Merrell•1/18/24, 6:12 PM

@Alpay Ariyak if you get a chance to glance this over.

Rundpod VLLM Cuda out of Memory

Similar Threads

Similar Threads

Similar Threads