R
RunPod•4mo ago
Siamak

Run Lorax on Runpod (Serverless)

I created a docker image similar to (https://github.com/runpod-workers/worker-tgi/blob/main/src/entrypoint.sh) for Lorax, but inside of the docker image I am getting connection refused: could you please check it?
GitHub
worker-tgi/src/entrypoint.sh at main · runpod-workers/worker-tgi
The RunPod worker template for serving our large language model endpoints. Powered by Text Generation Inference. - runpod-workers/worker-tgi
12 Replies
Siamak
Siamak•4mo ago
Siamak
Siamak•4mo ago
I solved the connection issue by: lorax-launcher --model-id $model --quantize awq --max-input-length=4096 --max-total-tokens=5096 --huggingface-hub-cache=/data --hostname=127.0.0.1 --port=8080 But I am getting this error:
Siamak
Siamak•4mo ago
CUDA is not available!
ashleyk
ashleyk•4mo ago
This requires CUDA 12.1 or higher, you worker probably has a CUDA version lower than 12.1. I have seen this happen when the machine hasd the wrong CUDA version. And there still doesn't seem to be a way to filter CUDA versions in serverless like you can do for GPU cloud. @JM showed me something that he could access internally but its still not live.
Siamak
Siamak•4mo ago
I just added --gpus all to docker run, ad te issue has been solved. but again I have connection issue!
Siamak
Siamak•4mo ago
@ashleyk ubuntu@150-136-88-165:~/sia/test_docker$ sudo docker run --gpus all -it --rm -p 8080:8080 --name generator runpod_test --help 2024-02-21T05:41:12.402252Z INFO lorax_launcher: Args { model_id: "TheBloke/Llama-2-13B-chat-AWQ", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 4096, max_total_tokens: 5096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "127.0.0.1", port: 8080, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2024-02-21T05:41:12.402366Z INFO download: lorax_launcher: Starting download process. --- Starting Serverless Worker | Version 1.6.2 --- INFO | Using test_input.json as job input. DEBUG | Retrieved local job: {'input': {'prompt': 'Hi, How are you?'}, 'id': 'local_test'} INFO | local_test | Started. ERROR | local_test | Captured Handler Exception ERROR | { "error_type": "<class 'requests.exceptions.ConnectionError'>", "error_message": "HTTPConnectionPool(host='127.0.0.1', port=8080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3bd135f040>: Failed to establish a new connection: [Errno 111] Connection refused'))",
Siamak
Siamak•4mo ago
this error is inside of the docker
JM
JM•4mo ago
Hey @ashleyk! CUDA filter did hit prod this week 🙂
JM
JM•4mo ago
No description
JM
JM•4mo ago
@Siamak you can use that on your SLS endpoint to make sure you have worker that meet you minimum requirements
Siamak
Siamak•4mo ago
Hi @JM the issue was the Lorax server was not running yet. so with setting a sleeping time the issue has been fixed. Thanks
JM
JM•4mo ago
Oh, glad you found that. Thanks!