Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.
2 Replies
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
Bell Chen
Bell ChenOP2y ago
Oh... yes. I will try

Did you find this page helpful?