Hello everyone. I am Dr. Furkan Gözükara. PhD Computer Engineer. SECourses is a dedicated YouTube channel for the following topics : Tech, AI, News, Science, Robotics, Singularity, ComfyUI, SwarmUI, ML, Artificial Intelligence, Humanoid Robots, Wan 2.2, FLUX, Krea, Qwen Image, VLMs, Stable Diffusion
terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802349 milliseconds before timing out. 97%|████████ | 1564/1616 [30:03<01:00, 1.16s/it]Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== ./sdxl_train.py FAILED ----------------------------------------------------- Failures: <NO_OTHER_FAILURES> ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-11-30_18:48:48 host : a99b8429df2b rank : 1 (local_rank: 1) exitcode : -6 (pid: 1082) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 1082 =====================================================
I asume you didn't train yet with Pixart? They just updated their repo and now collaborate with the LCM Team
( New) Nov. 30, 2023. PixArt collaborates with LCMs team to make the fastest Training & Inference Text-to-Image Generation System. Here, Training code & Inference code & Weights & Demo are all released, we hope users will enjoy them. Refer to docs for more details. At the same time, we update the codebase for better user experience and fix some bugs in the newest version.