Hi, I am getting error with LORA training with PromptTunning

Issue: CUDA Out of Memory while loading model checkpoint. Details: Model: meta-llama/Llama-4-Scout-17B-16E-Instruct with 4-bit quantization (BitsAndBytesConfig) Device detected: cuda (GPU with ~95 GiB total, ~600 MiB free) Error occurs at ~92% of checkpoint shard loading: OutOfMemoryError: Tried to allocate 2.50 GiB. GPU 1 has 94.97 GiB total, 600.19 MiB free. Including non-PyTorch memory, 94.38 GiB in use, 93.46 GiB by PyTorch. Current setup: Using dtype=torch.float16, gradient checkpointing enabled Batch size: 1, gradient accumulation steps: 4 Offloading layers (device_map="auto") with offload_folder="/workspace/offload" Goal: Load the model successfully without exceeding GPU memory. below is attached error logs
5 Replies
Madiator2011
Madiator20113w ago
either you need bigger gpu or reduce training batch size
Darshan Jain
Darshan JainOP2w ago
this is what i have right now. args = TrainingArguments( output_dir=ADAPTER_DIR, per_device_train_batch_size=1, # Reduced batch size for memory gradient_accumulation_steps=4, # Accumulate gradients to simulate larger batch learning_rate=3e-4, num_train_epochs=3, logging_steps=10, save_strategy="epoch", save_steps=50, dataloader_pin_memory=False, # Disable pin memory for MPS # fp16=True, # Not supported on MPS, using torch_dtype=float16 instead gradient_checkpointing=True, # Trade compute for memory max_grad_norm=1.0, # Gradient clipping warmup_steps=100, # Learning rate warmup ) @Elder Papa Madiator
Madiator2011
Madiator20112w ago
I cant help with settings
Darshan Jain
Darshan JainOP2w ago
@Elder Papa Madiator Do you have suggestion like which GPU I should use?
riverfog7
riverfog72w ago
17*16/2 = 135 maybe B200? or H200 might work

Did you find this page helpful?