R
RunPod4mo ago
marshall

vllm + Ray issue: Stuck on "Started a local Ray instance."

Trying to run TheBloke/goliath-120b-AWQ on vllm + runpod with 2x48GB GPUs:
2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465 INFO worker.py:1724 -- Started a local Ray instance.
2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465 INFO worker.py:1724 -- Started a local Ray instance.
It's stuck on Started a local Ray instance. and I've tried both with and without RunPod's FlashBoot has anyone encountered this issue before? requirements.txt:
vllm==0.2.7
runpod==1.4.0
ray==2.9.1
vllm==0.2.7
runpod==1.4.0
ray==2.9.1
build script:
from huggingface_hub import snapshot_download

snapshot_download(
"TheBloke/goliath-120b-AWQ",
local_dir="model",
local_dir_use_symlinks=False
)
from huggingface_hub import snapshot_download

snapshot_download(
"TheBloke/goliath-120b-AWQ",
local_dir="model",
local_dir_use_symlinks=False
)
initialization code:
from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)
from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)
9 Replies
Alpay Ariyak
Alpay Ariyak4mo ago
Are you using a pod, or a serverless endpoint with worker vllm?
marshall
marshall4mo ago
Serverless endpoint with vllm (custom minimal image)
Alpay Ariyak
Alpay Ariyak4mo ago
This is because ray doesn't get initialized with the right CPU count
Alpay Ariyak
Alpay Ariyak4mo ago
You can try this out https://github.com/runpod-workers/worker-vllm and play with lowering the environment variable VLLM_CPU_FRACTION, which will be 1 by default
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
marshall
marshall4mo ago
- Tensor Parallelism: Note that the more GPUs you split a model's weights accross, the slower it will be due to inter-GPU communication overhead. If you can fit the model on a single GPU, it is recommended to do so. - TENSOR_PARALLEL_SIZE: Number of GPUs to shard the model across (default: 1). - If you are having issues loading your model with Tensor Parallelism, try decreasing VLLM_CPU_FRACTION (default: 1).
I can't really find any references to this specific env var anywhere but the readme (I tried looking in vllm docs and worker-vllm code)... are there any docs specifying the exact value that is required? perhaps $(nproc) or since this is a "fraction"... automagically populate it with 1 / multiprocessing.cpu_count() ?
Alpay Ariyak
Alpay Ariyak4mo ago
It's because worker vllm uses a fork of vllm
marshall
marshall4mo ago
oh, that makes sense... I might try rebuilding an image using that fork instead, Thanks!
Alpay Ariyak
Alpay Ariyak4mo ago
Ofc! Lmk how it goes
Want results from more Discord servers?
Add your server
More Posts