Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4
I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example: runpod-vllm-nccl-diagnostic
Observations
- Environment: 2 x L40S GPU pods in US-TX-4 - Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log) - Workaround: Setting
NCCL_P2P_DISABLE=1
NCCL_P2P_DISABLE=1
prevents the hang but reduces P2P performance
Why It Matters
Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.
Questions for Runpod & the Community
1. Hardware Differences? Are there known L40S configurations that impede NCCL P2P? 2. Mitigation Any recommended approach beyond disabling P2P? 3. Filtering Hosts Can we specify a P2P-supported filter in the GraphQL API?
If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!