R
Runpod3mo ago
Tenofas

EUR-IS-1 extremely slow

From today, aug 13th, the EUR-IS-1 datacenter seems extremely slow. It was working fine yesterday. Today, using ComfyUI with my usual template, generation times are 10x slower, and I keep getting "Disconected" messages... anyone else facing the same troubles?
31 Replies
gufisha
gufisha3mo ago
Yes. This is painful, can't even get the template running.
Aron
Aron3mo ago
same here.
gufisha
gufisha3mo ago
Yeah, I hope someone sees this as https://uptime.runpod.io/ there is nothing showing.
Runpod status
Welcome to Runpod status page for real-time and historical data on system performance.
Aron
Aron3mo ago
Answer to my ticket: Thank you for the detailed report and for sharing the logs. We’re aware of an ongoing issue affecting network volumes in the EUR-IS-1 datacenter, which is causing slow read speeds and, in some cases, long startup times or unresponsive behavior in applications like ComfyUI. I’ll keep this ticket updated as soon as we have progress or a resolution to share. In the meantime, if you notice any change in performance positive or negative please let us know so we can include it in our investigation.
Dj
Dj3mo ago
I'm glad support was made aware, I wasn't so I wasn't able to update the uptime page sorry :( We're reporting this is fine though, I just found their conversation.
CodingNinja
CodingNinja3mo ago
Ohh, this uptime needs to be updated manually on the website? No automated health checks as of now?🐧
Dj
Dj3mo ago
not for the storage clusters they are sort of unpingable
Michael Chang
Michael Chang3mo ago
is the EUR-SI-1 data center shutting down?
Michael Chang
Michael Chang3mo ago
No description
Michael Chang
Michael Chang3mo ago
this alert pop under the L40 pods
Dj
Dj3mo ago
No, the owner of that machine intends on shutting it down. The rest of the DC is still available :)
gufisha
gufisha3mo ago
Yet again, issues with IS - 1
mitchken
mitchken2mo ago
These are being replaced by PRO6000 cards
Michael Chang
Michael Chang2mo ago
thanks for answering the loading time is so long it just times me out eventually
SUUUUIIIIIII
SUUUUIIIIIII2mo ago
The same problem
extrems69
extrems692mo ago
Same probleme here, So waste money, that's not faire
Michael Chang
Michael Chang2mo ago
if the EUR-IS-1 owner is a service provider that has signed contract with run pod official, i believe run pod should take action against EUR-IS-1 for its disappointing performance, like providing rotten meat to a restaurant, and causes customer sickness, the restaurant should take action before it becomes the restaurant's fault. i reported the issue in the feed back section, feel free to support my statement in order to engage runpod official's action
extrems69
extrems692mo ago
I did
Garðar
Garðar2mo ago
I'm experiencing severe stalls on the network volumes on EUR-IS-1 which is probably connected to why you got long loading times. Processes get stuck in D-state (request_wait_answer / fuse_direct_IO). Started seeing this yesterday but was also a problem last month. Any I/O on /workspace hangs; shells become unresponsive (Ctrl-C/Z doesn't work). Local disk I/O is fine. Is something wrong with the moosefs setup?
AmirKerr
AmirKerr2mo ago
Most likely EU-SE-1 is getting shut down as well. Doesn't work for 2 days.
rasmus
rasmus2mo ago
Eu-Ro I had the same stalls and hangs What’s the workaround? Not use workspace?
tacle2
tacle22mo ago
yeah same for me i did not seen any post from the team about this server and i'm not the only one complaining about that ... we should at least get our money back for this day .... I don't understand why he doesn't communicate about this, that way I wouldn't waste my time and therefore my money waiting for comfy to start, or wasting time understanding what is the cause of the problem when it's just the server
Dj
Dj2mo ago
It seems like our infrastructure team is aware, but at this time we have no action items. We'll continue to monitor. We implemented a solution at <t:1756900020:f> If you have a support ticket open for this, please let me know. If you don't have a support ticket open message me your account email.
Michael Chang
Michael Chang2mo ago
the solution don't seem to be working, it is either stuck with the "the port is not up yet" for an entire hour (still on-going), or completely not running any workflow, unable to load even the checkpoint. We are suffering loss from all the delayed task waiting to be done with runpod service, or simply sitting there, paying runpod and wait for a miracle. i hate to say it but I don't see how this is not a fraud. looking forward to it all returning to normal
mitchken
mitchken2mo ago
@Michael Chang to provide some feedback on the EUR-IS-1 cluster used for network storage, none of the included nodes surpassed even 50% utilization compared to the capacity they have available for the last week. Let me know if we could be of any assistance however support tickets are the best option to get swift results I guess.
Michael Chang
Michael Chang2mo ago
its happening again, the pod is up, but i can not connect to it, it just gives me a blank screen, thought it was solved because the pass 1 to 2 days were working fine
nailonge
nailonge2mo ago
and again I think. everything was fine and after a second, all started to take FOREVER out of nowhere
Tenofas
TenofasOP2mo ago
Yes, confirmed, I tested rtx5090 and rtx pro 6000... both are extremely slow and get stuck after few minutes.
gufisha
gufisha2mo ago
Can confirm as well..
jojje
jojje2mo ago
Same issue. I tested both 5090 and 4090 nodes. Here's the most forgiving read pattern imaginable; sequential with no other I/O going on in the pod, and massive 1 MiB read blocks. Can't be any kinder to storage infrastructure than that. And still, performance not great. With standard 4k reads, it's without doubt unusable.
root@79b7da44cdc6:~# for f in $(find /opt/comfyui/models/ -name "*.safetensors");do echo $f; dd if=$f bs=1M of=/dev/null;done
/opt/comfyui/models/clip_vision/clip_vision_h.safetensors
1205+1 records in
1205+1 records out
1264219396 bytes (1.3 GB, 1.2 GiB) copied, 24.3915 s, 51.8 MB/s
/opt/comfyui/models/vae/wan_2.1_vae.safetensors
242+1 records in
242+1 records out
613561776 bytes (614 MB, 585 MiB) copied, 9.68587 s, 63.3 MB/s
/opt/comfyui/models/loras/Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors
585+1 records in
585+1 records out
613561776 bytes (614 MB, 585 MiB) copied, 14.8218 s, 41.4 MB/s
/opt/comfyui/models/loras/Wan2.2-Lightning_I2V-A14B-4steps-lora_LOW_fp16.safetensors
^C50+0 records in
49+0 records out
root@79b7da44cdc6:~# for f in $(find /opt/comfyui/models/ -name "*.safetensors");do echo $f; dd if=$f bs=1M of=/dev/null;done
/opt/comfyui/models/clip_vision/clip_vision_h.safetensors
1205+1 records in
1205+1 records out
1264219396 bytes (1.3 GB, 1.2 GiB) copied, 24.3915 s, 51.8 MB/s
/opt/comfyui/models/vae/wan_2.1_vae.safetensors
242+1 records in
242+1 records out
613561776 bytes (614 MB, 585 MiB) copied, 9.68587 s, 63.3 MB/s
/opt/comfyui/models/loras/Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors
585+1 records in
585+1 records out
613561776 bytes (614 MB, 585 MiB) copied, 14.8218 s, 41.4 MB/s
/opt/comfyui/models/loras/Wan2.2-Lightning_I2V-A14B-4steps-lora_LOW_fp16.safetensors
^C50+0 records in
49+0 records out
I had a tiny sqlite DB on a network volume; the comfy update-manager db. This file gets read and written two a couple of hundred times whenever one opens the install view. It's just a few bytes per I/O operation so barely any data. It took several minutes for that page to open up as a result. So the network issue is latency, not throughput. If someone at runpod is "monitoring", they ought to be looking at router packet loss and misconfiguration. It shouldn't be hard to find, as it's the path between the SAN devices and the servers. Just trace the paths node by node. If you needs specific pod IDs to trace to and from, just ask. I can offer some.
BlackWhiteAsian
BlackWhiteAsian2mo ago
Same on EUR-RO-1. 5090 and 4090. ComfyUI used to start in ~15 seconds. Now 5 minutes, if it actually starts. Can't even get to start making videos. Been third day like this. On and off. BTW, anyone having the slowing issue with serverless, or is it just a pod thing? Considering to switch to serverless workers if it is better there?

Did you find this page helpful?