Hello RunPod Support Team,
I am encountering a reproducible container crash when training a LoRA model using WAN2.2 inside the AI Toolkit – Ostris – UI – Official template.
Pod configuration:
GPU: RTX 5090
Template: AI Toolkit – Ostris – UI – Official
Pod RAM: 96 GB
GPU VRAM: RTX 5090
Behavior:
Container starts normally.
Ostris UI loads correctly.
Training begins successfully.
After some time during LoRA training with WAN2.2, the container becomes unhealthy and crashes with:
WARN: container is unhealthy: triggered memory limits (OOM)