Issue: CUDA Out of Memory while loading model checkpoint.
Details:
Model: meta-llama/Llama-4-Scout-17B-16E-Instruct with 4-bit quantization (BitsAndBytesConfig)
Device detected: cuda (GPU with ~95 GiB total, ~600 MiB free)
Error occurs at ~92% of checkpoint shard loading:
OutOfMemoryError: Tried to allocate 2.50 GiB. GPU 1 has 94.97 GiB total, 600.19 MiB free.
Including non-PyTorch memory, 94.38 GiB in use, 93.46 GiB by PyTorch.
Current setup:
Using dtype=torch.float16, gradient checkpointing enabled
Batch size: 1, gradient accumulation steps: 4
Offloading layers (device_map="auto") with offload_folder="/workspace/offload"
Goal: Load the model successfully without exceeding GPU memory.
below is attached error logs