Hi, I am getting error with LORA training with PromptTunning
Issue: CUDA Out of Memory while loading model checkpoint.
Details:
Model: meta-llama/Llama-4-Scout-17B-16E-Instruct with 4-bit quantization (BitsAndBytesConfig)
Device detected: cuda (GPU with ~95 GiB total, ~600 MiB free)
Error occurs at ~92% of checkpoint shard loading:
OutOfMemoryError: Tried to allocate 2.50 GiB. GPU 1 has 94.97 GiB total, 600.19 MiB free.
Including non-PyTorch memory, 94.38 GiB in use, 93.46 GiB by PyTorch.
Current setup:
Using dtype=torch.float16, gradient checkpointing enabled
Batch size: 1, gradient accumulation steps: 4
Offloading layers (device_map="auto") with offload_folder="/workspace/offload"
Goal: Load the model successfully without exceeding GPU memory.
below is attached error logs
5 Replies
either you need bigger gpu or reduce training batch size
this is what i have right now.
args = TrainingArguments(
output_dir=ADAPTER_DIR,
per_device_train_batch_size=1, # Reduced batch size for memory
gradient_accumulation_steps=4, # Accumulate gradients to simulate larger batch
learning_rate=3e-4,
num_train_epochs=3,
logging_steps=10,
save_strategy="epoch",
save_steps=50,
dataloader_pin_memory=False, # Disable pin memory for MPS
# fp16=True, # Not supported on MPS, using torch_dtype=float16 instead
gradient_checkpointing=True, # Trade compute for memory
max_grad_norm=1.0, # Gradient clipping
warmup_steps=100, # Learning rate warmup
)
@Elder Papa Madiator
I cant help with settings
@Elder Papa Madiator Do you have suggestion like which GPU I should use?
17*16/2 = 135
maybe B200?
or H200 might work