Runpod•17mo ago

Trying to load a huge model into serverless

https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-72b Anyone have any idea how to do this in vLLM? I've deployed using two 80GB gpus and have had no luck

cognitivecomputations/dolphin-2.9.2-qwen2-72b · Hugging Face

13 Replies

blabbercrabOP•17mo ago

2024-07-07T10:13:37.060080427Z INFO 07-07 10:13:37 ray_utils.py:96] Total CPUs: 252 2024-07-07T10:13:37.060112418Z INFO 07-07 10:13:37 ray_utils.py:97] Using 252 CPUs 2024-07-07T10:13:39.223150657Z 2024-07-07 10:13:39,222 INFO worker.py:1753 -- Started a local Ray instance. 2024-07-07T10:13:42.909013372Z INFO 07-07 10:13:42 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='cognitivecomputations/dolphin-2.9.2-qwen2-72b', speculative_config=None, tokenizer='cognitivecomputations/dolphin-2.9.2-qwen2-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir='/runpod-volume/huggingface-cache/hub', load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cognitivecomputations/dolphin-2.9.2-qwen2-72b) 2024-07-07T10:13:43.234774592Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-07T10:13:48.090819086Z INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2 2024-07-07T10:13:49.634162208Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2 2024-07-07T10:13:49.634349607Z INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend. 2024-07-07T10:13:50.971622090Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend. 2024-07-07T10:13:50.971661235Z INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1 2024-07-07T10:13:51.888246699Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1 2024-07-07T10:13:51.888281517Z INFO 07-07 10:13:51 utils.py:118] generating GPU P2P access cache for in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:51.889113795Z INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:51.889199350Z WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. 2024-07-07T10:13:52.655130972Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:52.655172182Z (RayWorkerWrapper pid=14238) WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. 2024-07-07T10:13:52.655176579Z INFO 07-07 10:13:52 weight_utils.py:200] Using model weights format ['*.safetensors']

digigoblin•17mo ago

There is no error, that last log line means its still busy loading the model.

blabbercrabOP•17mo ago

I wasnt able to load it using one 80gb gpu, isnt 2 x 80gb excessive for the model size?

digigoblin•17mo ago

I asusme you're loading it from network storage?

digigoblin•17mo ago

Add up these file sizes: https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-72b/tree/main

cognitivecomputations/dolphin-2.9.2-qwen2-72b at main

digigoblin•17mo ago

They are considerably more than 80GB, so it definitely won't fit into a single 80GB GPU

Encyrption•17mo ago

Max is 2x80GB with serverless?

digigoblin•17mo ago

You can also use multiple 48GB and 24GB

yhlong00000•17mo ago

with 2x80GB I am not able to run that model, it give me out of memory error, I switch to 8x48GB and that works.😂 And btw, you have to select 1, 2, 4, 8, 16, or 32 GPUs, can't pick 10

yhlong00000•17mo ago

also 4x48GB = 192GB won't work as well, out of memory😂

Unknown User•17mo ago

Message Not Public

blabbercrabOP•17mo ago

https://tenor.com/view/gg-gojo-usb-usb-gojo-rip-bozo-gif-5949329359688209853

Tenor

Unknown User•17mo ago

Message Not Public

Gaming

Programming

Trying to load a huge model into serverless

Did you find this page helpful?