Training jobs using script
Hey, Can anyone tell me if runpod gives the feature to create a training script that can be run from anywhere and I can use that to create a GPU instance, and load and save my data to external cloud storages just like in AWS Sagemaker training script mode? I need to train multiple models in such manner with different architectures to see which one performs the best.
19 Replies
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
https://docs.runpod.io/sdks/overview
https://docs.runpod.io/cli/overview
https://docs.runpod.io/pods/configuration/export-data
Here are some useful links.
Overview | RunPod Documentation
Unlock serverless functionality with RunPod SDKs, enabling developers to create custom logic, simplify deployments, and programatically manage infrastructure, including Pods, Templates, and Endpoints.
Overview | RunPod Documentation
RunPod CLI (runpodctl) is a command-line interface tool designed to automate and manage GPU pods on RunPod.
Export data | RunPod Documentation
Export RunPod data to various cloud providers, including Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, Backblaze B2 Cloud Storage, and Dropbox, with secure key and access token management.
I'm fairly new to RunPod. Can you please point me to a tutorial where a remote training job is run on a pod, the model weights are stored on S3, and the pod automatically kills itself once the training is complete?
You probably have to write some code to pull data from s3 and after training you can terminate the pod using our cli. Btw, ChatGPT is really good at writing code😀
https://docs.runpod.io/cli/overview
Overview | RunPod Documentation
RunPod CLI (runpodctl) is a command-line interface tool designed to automate and manage GPU pods on RunPod.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Sorry, it is still unclear. Does runpod has a tutorial on training a custom model on a GPU instance? I have tried searching for it, but I have not found any.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Probably not working anymore since the Dreambooth endpoint used TheLastBen's code
I recommend using Kohya_ss, EveryDream2Trainer or OneTrainer
This guy has some videos for training image models:
https://www.youtube.com/@SECourses/videos
YouTube
SECourses
Welcome to Software Engineering Courses (SECourses) – the ultimate destination for skillfully curated insights into state-of-the-art technologies and programming paradigms. We demystify the realms of Artificial Intelligence, Stable Diffusion, DreamBooth, LoRA, ControlNet, Textual Inversion, Software Engineering, Programming, C#, .NET, ASP .NET, ...
What kind of model are you training?
Well, I'm training different kinds of segmentation models for my tasks, varying from simple U-Net to Attention U-Net, and might also go for transformer-based segmentation models. I'd like to run an instance for each model, so I can compare their performance in as little time as possible.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
A big problem is to auto-kill the pod once the training is complete and saving the model weights before that.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Can you please shed some light on how to auto-kill the instance?
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Okay, thanks!
If I just stop my pod and do not remove it, will I still be billed? And once I'll be inside the pod, can I stop it from there? Will the command
runpodctl remove pod $RUNPOD_POD_ID work from inside the pod?Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View