Runpod•2y ago

Questions on large LLM hosting

1 I see mentions of keeping a model in a Network Volume to share between all endpoints. But if I already have my model inside of a container image-- wouldn't my model already be cached in that image? Which would be faster for cold boots? 2 My workload is not consistent, so I understand FlashBoot is unlikely to help a lot-- but is there any reason not to enable it? When I hover over it, it indicates to test output quality first-- what does this mean and why? 3 What is "container disk"? My models are already inside my image and they seem to load fine-- so what is the purpose of this? Additional space to be used at runtime-- like if I was downloading a model when the container starts?

6 Replies

XangelixOP•2y ago

4 an extra one How much VRAM does VLLM typically use outside of the weights? I'm testing a model now that only uses 38GB in weights, but I'm getting OOM on 48GB gpus...

XangelixOP•2y ago

could this be fixed with ENFORCE_EAGER=1 ?

Unknown User•2y ago

Message Not Public

XangelixOP•2y ago

this didn't seem to lower it enough, is this message a typo? wouldn't I want to raise GPU_MEMORY_UTILIZATION if i'm getting OOM https://discord.com/channels/912829806415085598/1211740161948524564/1212674202465869864

Unknown User•2y ago

Message Not Public

Gaming

Programming

Questions on large LLM hosting

Did you find this page helpful?