Runpod•2y ago

Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.

2 Replies

Unknown User•2y ago

Message Not Public

Bell ChenOP•2y ago

Oh... yes. I will try

Gaming

Programming

Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

Did you find this page helpful?