config.json: 0%| | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py :56 2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z raise e
2024-01-15T21:25:12.969535724Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z self._run_workers(
2024-01-15T21:25:12.971006647Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
config.json: 0%| | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py :56 2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z raise e
2024-01-15T21:25:12.969535724Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z self._run_workers(
2024-01-15T21:25:12.971006647Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^