I want to add runpod into a tier of load balanced llm models behind an app like openrouter.ai, but the decision will occur in our infrastructure. When i invoke a server less instance with my app and a task is completed, how am I billed for idle time if the container unloads the model from gpu memory?
In other words I want to reduce costs and increase performance by only needing to load the model after an idle timeout, paying only for the small app footprint in storage/memory