I would use 7 repeat and classification images with 10 epochs, no captions, 128/64 weights. For Adafactor use classic parameters for LR and Unet LR: 0.0001, TE: 5e-05, token length: 225, model: RealisticVision 5.1, optimizer args: scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01