Optimizing VLLM for serverless

Hello. I am trying to optimize the VLLM for the serverless endpoint. The default VLLM settings are blazing fast for cached workers (~1s) but unusable with cold start initialization (40-60 or more seconds).

Forcing eager mode removes the CUDA graph capture and helps push the initialization cold starts down to ~20s with a price of a slower generation time. But other than that, I feel stuck about what could be improved since currently the longest tasks are creating the LLM engine and VLLM's Memory profiling stage. Each takes up to 6 seconds. I am attaching the complete log file with time comments from such a job.

I am wondering if anyone found the settings sweet spot for the fastest cold starts and acceptable generation speed, or if there's a way to remove the initialization part for newly spawned workers. Although I already researched many things, from automatic caching on a network volume (which didn't work at all and when using bitsandbytes models there is no cache being saved) to snapshotting and trying to share the initiated state between the workers (which is probably not possible).

Any ideas or help will be appreciated.