We are using 2 serverless endpoints on runpod and the "Delay Time" (which I assume measures end to end time) varies drastically between the endpoints. They both use the same hardware (the A5000 option) and one of them has sub-second delay times and the other ~50 seconds up to 180s.
On the slow endpoint, the worst cold start time is reported as 13s, and the execution time is ~2s, which don't add up to the delay time. There are ~50 seconds unnacounted for.
The other endpoint using the same hardware does not observe such drastic delay time.
Solution
Delay time is NOT end to end time. It is the cold start time + the time that your request is in the queue for before a worker picks it up. Delay time can be dramatically impacted if all of your workers are throttled.