Questions on large LLM hosting

1 I see mentions of keeping a model in a Network Volume to share between all endpoints. But if I already have my model inside of a container image-- wouldn't my model already be cached in that image? Which would be faster for cold boots? 2 My workload is not consistent, so I understand FlashBoot is unlikely to help a lot-- but is there any reason not to enable it? When I hover over it, it indicates to test output quality first-- what does this mean and why? 3 What is "container disk"? My models are already inside my image and they seem to load fine-- so what is the purpose of this? Additional space to be used at runtime-- like if I was downloading a model when the container starts?
No description
6 Replies
Xangelix
XangelixOP2y ago
4 an extra one How much VRAM does VLLM typically use outside of the weights? I'm testing a model now that only uses 38GB in weights, but I'm getting OOM on 48GB gpus...
Xangelix
XangelixOP2y ago
No description
Xangelix
XangelixOP2y ago
could this be fixed with ENFORCE_EAGER=1 ?
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View
Xangelix
XangelixOP2y ago
this didn't seem to lower it enough, is this message a typo? wouldn't I want to raise GPU_MEMORY_UTILIZATION if i'm getting OOM https://discord.com/channels/912829806415085598/1211740161948524564/1212674202465869864
Unknown User
Unknown User2y ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?