Runpod•3w ago

Hi, I am getting error with LORA training with PromptTunning

Issue: CUDA Out of Memory while loading model checkpoint. Details: Model: meta-llama/Llama-4-Scout-17B-16E-Instruct with 4-bit quantization (BitsAndBytesConfig) Device detected: cuda (GPU with ~95 GiB total, ~600 MiB free) Error occurs at ~92% of checkpoint shard loading: OutOfMemoryError: Tried to allocate 2.50 GiB. GPU 1 has 94.97 GiB total, 600.19 MiB free. Including non-PyTorch memory, 94.38 GiB in use, 93.46 GiB by PyTorch. Current setup: Using dtype=torch.float16, gradient checkpointing enabled Batch size: 1, gradient accumulation steps: 4 Offloading layers (device_map="auto") with offload_folder="/workspace/offload" Goal: Load the model successfully without exceeding GPU memory. below is attached error logs

message.txt

5 Replies

Madiator2011•3w ago

either you need bigger gpu or reduce training batch size

Darshan JainOP•2w ago

this is what i have right now. args = TrainingArguments( output_dir=ADAPTER_DIR, per_device_train_batch_size=1, # Reduced batch size for memory gradient_accumulation_steps=4, # Accumulate gradients to simulate larger batch learning_rate=3e-4, num_train_epochs=3, logging_steps=10, save_strategy="epoch", save_steps=50, dataloader_pin_memory=False, # Disable pin memory for MPS # fp16=True, # Not supported on MPS, using torch_dtype=float16 instead gradient_checkpointing=True, # Trade compute for memory max_grad_norm=1.0, # Gradient clipping warmup_steps=100, # Learning rate warmup ) @Elder Papa Madiator

Madiator2011•2w ago

I cant help with settings

Darshan JainOP•2w ago

@Elder Papa Madiator Do you have suggestion like which GPU I should use?

riverfog7•2w ago

17*16/2 = 135 maybe B200? or H200 might work

Gaming

Programming

Hi, I am getting error with LORA training with PromptTunning

Did you find this page helpful?