Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic

Observations

- Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting

NCCL_P2P_DISABLE=1

NCCL_P2P_DISABLE=1

prevents the hang but reduces P2P performance

Why It Matters

Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.

Questions for Runpod & the Community

1. Hardware Differences?
Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API?

If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!

Runpod•14mo ago•

6 replies

Soulmind

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

Observations

- Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting

NCCL_P2P_DISABLE=1

NCCL_P2P_DISABLE=1

prevents the hang but reduces P2P performance

Why It Matters

Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

Observations

Why It Matters

Questions for Runpod & the Community

Similar Threads

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

Observations

Why It Matters

Questions for Runpod & the Community

Similar Threads

Similar Threads

Similar Threads