Offloading multiple models
Hi guys, anyone has experience with a inference pipeline that uses multiple models? Wondering how best to manage loading of models that exceed a worker's vram if everything is on vram. Any best practices / examples on how to keep model load time as minimal as possible.
Thanks!
2 Replies
Unknown User•15mo ago
Message Not Public
Sign In & Join Server To View
btw, you can also select multiple GPU per worker, if you need to load large models.
Some tips to reduce start time: