Issue with Websocket latency over serverless http proxy since runpod outage
We have a runpod serverless endpoint which we have been using to stream frames over direct one-to-one websocket. We have a lightweight version of this endpoint we've been using that streams simple diagnostic images, and a production version that streams AI generated frames. Frames are configured to stream at 18fps in both cases to create an animation.
We now see that both versions of this endpoint fail to stream frames at a reasonable rate, hovering around 1 fps. The lightweight diagnostic frames take virtually no time to generate, and we have confirmed with logging that the AI generated frames in the production version are not generating any slower, and should still be able to meet the 18 fps demand. But we see that the time to send frames over websocket is on the order of 1s per frame, and is very unstable. See below a snippet from our logs showing fast image generation times, but slow times for sending images over websocket
Compare this to the attached screen shot showing a previously working version in which we can see from the logs that we are receiving many more than 1 frame within a one second window.
We only started seeing this issue after runpod came back up from the outage earlier today. We have been testing with this setup in a variety of configurations over the last two weeks and have only started seeing this issue as of today after the runpod outage occurred.
We would very much appreciate some attention on this issue @Dj. It is very impactful at the moment for our org. Could you let us know if there are other tests we could do on our end that would provide helpful data to assess root cause and identify a solution? Thanks very much for your help.
Tagging @huemin for visibility.

7 Replies
The timing of the outage and this specific bug (assuming you're seeing the issue I think it is) are unrelated. We deliver traffic to/from your pod through the RunPod Proxy which is about 6 or so servers deployed in the US and EU. We know the actual IP of your host, and tunnel that traffic through whichever server would be the fastest.
It's interesting that I only started seeing this issue about the last time we had an outage affecting serverless and more users are affected after another serverless outage. Those events may be related, but since I'm not certain I won't confirm that yet. Do you also see the issue with the proxy when testing locally? If so, can you help me by grabbing an
mtr to the URL you have in that screenshot as "WebSocket URL"? You don't have to share the output of the mtr here - you can DM me.DM'ed! Thanks
My use case is identical and I think I've had the same issue crop up over the last 24hrs. Did you two ever figure out what this issue is @abush @Dj ?
I'm sending frames over a websocket connection to and from the instance spun up by serverless about 24fps bidirectionally. This has been working great till today when the frames simply seem to barely ever make it back over the websocket connection. no changes on my end
I spoke with DJ about this over DMs and provided DJ with some logging while the issue was occurring. DJ had informed me that this was most likely a resource issue on the web proxy servers, and that they could allocate more ram to address the problem. Since then, we haven't seen the issue again, but maybe DJ can provide some further clarification
I'm working on figuring this one out. It's a complicated problem to debug, and while I have a theory it's equally complicated for me to test it.
My belief is the servers we're using to proxy traffic are in need of a reboot due to stress. These machines are just under heavy utilization and the timing of outages exacerbate it, A couple have made a noticeable improvement but the majority ride their usage near the top. There are about 16 of these servers, and because they're behind Cloudflare and then doing routing of their own I can't know which a given user is connecting through at a given moment.
For similar reasons, I can't know which servers are safe to reboot as well. I'll see how easy it is to convince someone to make these VMs more powerful but that's really just a theory.
Is the situation the same for serverless vs pods? I used to do this by manually orchestrating pods to spin up and down and never had the issue. Startup time was way slower and it would be a ton of work to switch back to that, and I don't know if even if I switched back the same issue would be present
It's all traffic routed through the HTTP Proxy (podid-port.proxy.runpod.net)