R
RunPod2d ago
ozzie

trl vllm-serve not binding to port

I have a pod with two A6000 and I am trying to run vLLM on one of them via:
VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=TRACE trl vllm-serve --model meta-llama/Meta-Llama-3-8B-Instruct --gpu_memory_utilization=0.75 --max_model_len 2048 --host 0.0.0.0 --port 8000
VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=TRACE trl vllm-serve --model meta-llama/Meta-Llama-3-8B-Instruct --gpu_memory_utilization=0.75 --max_model_len 2048 --host 0.0.0.0 --port 8000
AFAICT, the model launches fine however it seems like there is a problem with binding to the port. I see nothing when doing lsof -i :8000 Is there any obvious additional configuration I need todo?
1 Reply
ozzie
ozzieOP2d ago
Looks to be a problem with TRL, specifically related to this issue: https://github.com/huggingface/trl/issues/2923 My trl env in case its helpful to anyone:
- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- TRL version: 0.17.0
- PyTorch version: 2.6.0
- CUDA device(s): NVIDIA A40
- Transformers version: 4.51.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- Datasets version: 3.5.1
- HF Hub version: 0.30.2
- bitsandbytes version: 0.45.5
- DeepSpeed version: not installed
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.76.0
- PEFT version: 0.15.2
- vLLM version: 0.8.1
- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- TRL version: 0.17.0
- PyTorch version: 2.6.0
- CUDA device(s): NVIDIA A40
- Transformers version: 4.51.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- Datasets version: 3.5.1
- HF Hub version: 0.30.2
- bitsandbytes version: 0.45.5
- DeepSpeed version: not installed
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.76.0
- PEFT version: 0.15.2
- vLLM version: 0.8.1
GitHub
NCCL timeout when GRPO training with vllm · Issue #2923 · hugging...
I'm working on reproducing the experiments from this awesome blog. However, I’m encountering an issue when enabling vLLM. When I run the training without vLLM, everything works fine, and I get ...

Did you find this page helpful?