R
RunPod5mo ago
Casper.

worker-vllm cannot download private model

I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model.
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
--secret id=HF_TOKEN \
--build-arg MODEL_NAME="my_model_path" \
./worker-vllm
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
--secret id=HF_TOKEN \
--build-arg MODEL_NAME="my_model_path" \
./worker-vllm
35 Replies
ashleyk
ashleyk5mo ago
@Alpay Ariyak any idea?
Alpay Ariyak
Alpay Ariyak5mo ago
Try also specifying the model name as an environment variable within the endpoint template
Casper.
Casper.5mo ago
I just followed the guide on the worker-vllm repository
Casper.
Casper.5mo ago
Shouldn't this work?
No description
Casper.
Casper.5mo ago
One of the serverless nodes printed this which seems correct (removed the model)
{
"model":"<my_model_path>",
"download_dir":"/runpod-volume",
"quantization":"None",
"load_format":"auto",
"dtype":"auto",
"disable_log_stats":true,
"disable_log_requests":true,
"trust_remote_code":false,
"gpu_memory_utilization":0.95,
"max_parallel_loading_workers":48,
"max_model_len":"None",
{
"model":"<my_model_path>",
"download_dir":"/runpod-volume",
"quantization":"None",
"load_format":"auto",
"dtype":"auto",
"disable_log_stats":true,
"disable_log_requests":true,
"trust_remote_code":false,
"gpu_memory_utilization":0.95,
"max_parallel_loading_workers":48,
"max_model_len":"None",
But somehow, it is still not able to find the model
Alpay Ariyak
Alpay Ariyak5mo ago
Could you share the error message you get please @casper_ai I think I got it, will push a fix soon Hi @Casper. , just pushed the update to main
Casper.
Casper.5mo ago
Thanks for making the update! I will test later today
Alpay Ariyak
Alpay Ariyak5mo ago
Of course! Pushed custom jinja chat templates as well
Casper.
Casper.5mo ago
@Alpay Ariyak I'm still getting the same error, although I see the path has changed in the vLLM config
Alpay Ariyak
Alpay Ariyak5mo ago
It seems like the issue is that your model doesn’t have a tokenizer
Casper.
Casper.5mo ago
No description
Casper.
Casper.5mo ago
That can't be right, the tokenizer is there It says 401 unauthorized So is the issue that you are not downloading the tokenizer into the directory perhaps? I thought the idea with downloading the model into the image was to 1) reduce startup time, and 2) having a secure environment with no access to your Huggingface token
Casper.
Casper.5mo ago
GitHub
Download tokenizer upon build by casper-hansen · Pull Request #39 ·...
This downloads the tokenizer when building the worker-vllm image. This has the following benefit: You do not have to send any network request to Huggingface during initialization. This means you d...
Casper.
Casper.5mo ago
Please review @Alpay Ariyak 🙂
Alpay Ariyak
Alpay Ariyak5mo ago
Looks good, have you built an image with it and tested it?
Casper.
Casper.5mo ago
I tested that it downloads the tokenizer to the directory specified, but have not ran a deployment yet I can run a deployment to test it Any tips on how to push images with models inside to Docker Hub faster? Takes like half an hour even though my internet is speedy
Alpay Ariyak
Alpay Ariyak5mo ago
I can test it in a bit, no worries Unfortunately not Is there a reason you’re baking the model in vs using the pre-built image? I haven’t seen too much of a difference in load times thus far
Casper.
Casper.5mo ago
It's just the best way currently. You avoid all outgoing traffic this way I think a better alternative could be to use the snapshot_download functionality from Huggingface Hub instead That way, you make sure you download everythign Tbh I don't think this PR will solve the issue because it also needs the config and everything else Okay, I replaced the current download with snapshot_download. Building and deploying in a moment
Alpay Ariyak
Alpay Ariyak5mo ago
Will get back to you, unavailable for the next 2 hours
Casper.
Casper.5mo ago
I tried using the PR and I think it must be something else too
Casper.
Casper.5mo ago
I'm not sure why but the engine.py is not able to find the tokenizer
Alpay Ariyak
Alpay Ariyak5mo ago
Yeah, tokenizer needs to point to the downloaded path, I’ll fix this up shortly
Casper.
Casper.5mo ago
Wait I might have just compiled from the wrong branch I think I compiled this from the main branch instead of my PR Thanks, it would be great! I'm still testing my PR as the problem is also that worker-vllm currently does not download the tokenizer or any config files, which will inevitably lead to an error when dealing with private repositories
Alpay Ariyak
Alpay Ariyak5mo ago
Of course! Will fix that for sure Although what would be the issue with just specifying the env variable HF_TOKEN in the endpoint template? It would allow for the tokenizer download, and tokenizers are always tiny, so should be a quick download
Casper.
Casper.5mo ago
I do not want to add a HF_TOKEN to a production environment since it grants access to everything meant to be kept private And it will also help with reducing delay as much as possible
Alpay Ariyak
Alpay Ariyak5mo ago
So I think we should keep the model download the same and separately snapshot download the tokenizer This is because if we snapshot download the entire repo, it will download all formats of the model (e.g. both .bin and .safetensors), which would bloat the image size
Casper.
Casper.5mo ago
It's not only the tokenizer that should cause issues right? vLLM also needs to load the model config Otherwise, I do agree that we can try to minimize bloating the image size
Alpay Ariyak
Alpay Ariyak5mo ago
Hmm, didn’t notice any issues with lack of model config yet, but could be an issue potentially as I did not test private models If anything, we can just have a priority list of formats (+ability to specify directly with LOAD_FORMAT env var) and download this way, similarly to vllm, but also allowing json files and whatever else is needed
Casper.
Casper.5mo ago
I would suggest you just upload opt-125m or something small to test with I modified the PR to only download tokenizer and config
snapshot_download(
model,
cache_dir=download_dir,
allow_patterns=[
"*token*",
"config.json",
]
)
snapshot_download(
model,
cache_dir=download_dir,
allow_patterns=[
"*token*",
"config.json",
]
)
Casper.
Casper.5mo ago
I updated the PR and testing again now https://github.com/runpod-workers/worker-vllm/pull/39
GitHub
Download full repository upon build by casper-hansen · Pull Request...
This downloads the full repository when building the worker-vllm image. This has the following benefit: You do not have to send any network request to Huggingface during initialization. This means...
Casper.
Casper.5mo ago
A feature could be ALLOW_PATTERNS that you specify as a comma separated list if you want to overrule
Alpay Ariyak
Alpay Ariyak4mo ago
Sorry, got really busy, just pushed a commit to the PR, but won't be able to test until tomorrow, let me know if you have any thoughts Pushed another commit and merged into main
Casper.
Casper.4mo ago
Thanks! Got it working and it runs pretty smooth now
Alpay Ariyak
Alpay Ariyak4mo ago
Happy to hear that!
kopyl
kopyl4mo ago
@Alpay Ariyak so what was the issue?
Want results from more Discord servers?
Add your server
More Posts