Hi,
I’m using a RunPod Serverless endpoint and occasionally I see the endpoint becoming throttled during inference.
The workload is GPU-heavy (video / image generation), and throttling seems to happen when requests run for a bit longer or when CPU-side processing (e.g. ffmpeg / preprocessing) is involved.
I’d like to understand:
• What are the common causes of throttling on Serverless endpoints?
• What are the recommended ways to mitigate throttling?
(e.g. limiting concurrency, splitting CPU/GPU workloads, adjusting endpoint settings, or using a different deployment type)
Any guidance or best practices would be appreciated. Thanks!