I'm currently operating a queue-based serverless endpoint that handles LLM (Large Language Model) workloads. To optimize performance by caching model weights across workers, I attempted to attach a network volume. However, this configuration has led to significant degradation in worker availability—most workers now appear as throttled for the majority of the day.
Is there a recommended workaround or alternative approach to achieve model weight caching without causing these throttling issues? Any guidance on best practices for shared storage in serverless setups would be greatly appreciated.