I can not do training out of memory error I got)
Hellow there
I have rented a pod with GPU: H100 SXM
RAM:251 G RAM .I tried to train my model on images and their mask but unfortunately it return Out of memory error. Please help I am very confused
3 Replies
maybe your training code / app? or it just requires more gpu vram i'm guessing
any logs / error you can share
Epoch 1/30
2025-03-21 06:26:13.742233: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0bfc) ran out of memory trying to allocate 256.00MiB (rounded to 268435456)requested by op UNetPP/X01/dropout/GreaterEqual
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2025-03-21 06:26:13.742334: I tensorflow/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
2025-03-21 06:26:13.742380: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (256): Total Chunks: 183, Chunks in use: 179. 45.8KiB allocated for chunks. 44.8KiB in use in bin. 16.4KiB client-requested in use in bin.
2025-03-21 06:26:13.742396: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (512): Total Chunks: 26, Chunks in use: 25. 13.0KiB allocated for chunks. 12.5KiB in use in bin. 12.5KiB client-requested in use in bin.
2025-03-21 06:26:13.742409: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (1024): Total Chunks: 16, Chunks in use: 15. 18.5KiB allocated for chunks. 17.2KiB in use in bin. 15.1KiB client-requested in use in bin.
2025-03-21 06:26:13.742422: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (2048): Total Chunks: 2, Chunks in use: 2. 4.0KiB allocated for chunks. 4.0KiB in use in bin. 4.0KiB client-requested in use in bin.
2025-03-21 06:26:13.742437: I tensorflow/tsl/framework/bfc_allocator.cc:1046]
I don't know
i don't know neither without additional information, but you can try debugging to figure that out
or try bigger vram immediately
yeah it says in the first few lines, your gpu ran out ofvram