Worker Throttling Issues After Attaching Network Volume to Serverless Endpoint
I'm currently operating a queue-based serverless endpoint that handles LLM (Large Language Model) workloads. To optimize performance by caching model weights across workers, I attempted to attach a network volume. However, this configuration has led to significant degradation in worker availability—most workers now appear as throttled for the majority of the day.
Is there a recommended workaround or alternative approach to achieve model weight caching without causing these throttling issues? Any guidance on best practices for shared storage in serverless setups would be greatly appreciated.
Is there a recommended workaround or alternative approach to achieve model weight caching without causing these throttling issues? Any guidance on best practices for shared storage in serverless setups would be greatly appreciated.