Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4
I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic
Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.
runpod-vllm-nccl-diagnostic
Observations
- Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting
NCCL_P2P_DISABLE=1prevents the hang but reduces P2P performance
Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.
Questions for Runpod & the Community
- Hardware Differences?
Are there known L40S configurations that impede NCCL P2P? - Mitigation
Any recommended approach beyond disabling P2P? - Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API?