Runpod•6mo ago

very slow 5090 pod

hello, this pod a02462e46395 seems to be terribly slow. i'm trying to install flash_attn and it's building for more than 30 minutes. can someone please check?

40 Replies

Madiator2011•6mo ago

flash_attn is slow on any gpu's if you build from source, install from whl files instead

bghiraOP•6mo ago

it never takes this long on other providers to build, maybe the cpu is misbehaving i need to build it from source; i am developing extensions it is still building, i don't think it will ever complete still building @Dj

Dj•6mo ago

Let me look, we had this problem yesterday but I thought we fixed it You could also be facing another issue, building flash_attn just sort of sucks Are you sure that's the pod id? I'm thinking you meant gth4o6vnnoowy8.

bghiraOP•6mo ago

oh yes, gth4o6vnnoowy8

Dj•6mo ago

I wonder what we can do to change the hostname of pods at the prompt This is id the first few digits of the container hash

bghiraOP•6mo ago

export PS1 :KEKLEO:

Dj•6mo ago

So it's trackable for me, but its not as straightforward could :Hmm:

bghiraOP•6mo ago

i should have dbl checked, my fault

Dj•6mo ago

No it's fine, it works but it would just be nice in general No actually we totally can :thinkMan: We know the pod id as an env var

Dj•6mo ago

Dj•6mo ago

Btu the server your pod is on is still responsive, I can see 2 vcores are under a load from about when you complained

Dj•6mo ago

Can you tell when you started your job

bghiraOP•6mo ago

i think that's the job yea it looks like it's at 1.5GHz Curf in atop maybe the scheduler is struggling

bghiraOP•6mo ago

like it's throttled to min_freq

Dj•6mo ago

I'm not sure what the unit is for "System Load" I didnt even have access to this tool in particular until the other day

bghiraOP•6mo ago

it's probably sysioload, includes cpu load and iowait

 21:16:37 up 27 days, 12:53,  0 users,  load average: 2.06, 2.04, 2.04

 21:16:37 up 27 days, 12:53,  0 users,  load average: 2.06, 2.04, 2.04

that ^ @Dj any way you can try and set the CPU governor to performance mode to see if the problem disappears? ok it took 4 hours to build flash_attn at lowest freq and then i terminated it because the performance of the 5090s is very poor vs 4090

riverfog7•6mo ago

@bghira use prebuilt wheel For flash attention It takes about 30min even if the cou is normal

bghiraOP•6mo ago

i have to build it.

Madiator2011•6mo ago

then you will have to wait 30 min+ as it takes that long on my local pc too tried to set: TORCH_CUDA_ARCH_LIST=

bghiraOP•6mo ago

it took 4 hours and the cpu was at 1.5ghz @dj or someone else can handle the issue guys, you dont have to keep commenting, the cpu should have been around 3.8ghz and is unnecessarily throttled when you guys get a chance, can i please get credits for this pod since the hardware was not functioning correctly

Dj•6mo ago

@yhlong00000 Can you take a look at this?

riverfog7•6mo ago

sad

bghiraOP•6mo ago

well i did preserve the one i built, since it took so long. 🙂 but it'll only be so useful until next build

riverfog7•6mo ago

why do you have to build it tho

bghiraOP•6mo ago

working on updates to it 😄 i needed to test it on a 5090

riverfog7•6mo ago

ah 5090 makes sense

bghiraOP•6mo ago

i don't have one but i see hosts with 8.. in one server.. i'm like.. how lol

riverfog7•6mo ago

isnt it standard H100s are 8 per server A100s too

Unknown User•6mo ago

Message Not Public

bghiraOP•6mo ago

any update? (cc @yhlong00000)

yhlong00000•6mo ago

We use different CPUs in EU-RO-1 and EUR-IS-1. I’m curious if you’ve tried running your workload in EU-RO-1, would there be a difference.

bghiraOP•6mo ago

no @yhlong00000 i'm pretty convinced you guys are purposely throttling everything

  Model name:             AMD EPYC 9555 64-Core Processor
    CPU family:           26
    Model:                2
    Thread(s) per core:   2
    Core(s) per socket:   56
    Socket(s):            2
    Stepping:             1
    Frequency boost:      enabled
    CPU max MHz:          4409.3750
    CPU min MHz:          1500.0000
    BogoMIPS:             6399.98

  Model name:             AMD EPYC 9555 64-Core Processor
    CPU family:           26
    Model:                2
    Thread(s) per core:   2
    Core(s) per socket:   56
    Socket(s):            2
    Stepping:             1
    Frequency boost:      enabled
    CPU max MHz:          4409.3750
    CPU min MHz:          1500.0000
    BogoMIPS:             6399.98

why on EARTH is this CPU only ever at 1.5GHz? this is a $7.99/hr B200 instance! need this escalated please

Dj•6mo ago

I do want to note that we're not applying any throttling to any of our hardware - especially not Secured Cloud servers as they're fully under our control. It makes sense to see the CPU min reported as 1500, but it doesn't make sense that the CPU didn't speed up to help you during code compilation. I'm getting this looked into now (and maybe into tomorrow, not sure how fast we can move here on a Sunday).

yhlong00000•6mo ago

Just to clarify, we set CPU limits, but we’re limiting CPU time, not CPU clock speed. The processor still runs at full speed, like 4.5 GHz, but we’re controlling how much of that time your container is allowed to use it. For example, imagine a physical machine with two GPUs, and your pod is assigned one of them. The pod is also limited to 50% of the CPU, it means you’re allowed to use the CPU for only half of the time, say, 50 milliseconds out of every 100. Now, if you’re running CPU-intensive tasks for an extended period, it might feel like you’re running at half the clock speed, but what’s really happening is: you’re running at full speed during your quota window, and then getting throttled or paused for the rest. So it’s not slower per cycle, you’re just not allowed to run the whole time

bghiraOP•6mo ago

:SadgeBusiness: yeah, understood, so it's not like a super low latency platform, just more for like, bulk operations or "just need vram, not speed"

riverfog7•6mo ago

JM•6mo ago

AMD 9555 is actually way faster and more powerfull than the reference architecture suggested by Nvidia with the B200 DGX pods. Personally, I am a big fan of the 9575F, or 9655 and 9755 for CPU heavy workloads. That said, AMD 9555 in dual configuration is a serious system, and the latest gen available for HPC purposes. To add to Hailong's point, if access to an entire 9655 CPU is expected, given there are 2 of those CPU per 8 GPU system, 4 GPU have to be rented. Hope this helps.

riverfog7•6mo ago

Are there those numa nodes stuff That affect performance When process is spread on multiple numq nodes numa*

Blaine The Gypsy•3w ago

The CPU performance on RunPod is basically unusable. A task that took 1 minute on my local machine was projected to take 55 minutes on RunPod. I sat there for 5 minutes, gave up and terminated the pod. I saw someone else who waited 21 minutes for the same task to complete. I can understand not giving the full CPU, but your CPU performance shouldn't be 21X-55X slower than my four year old Intel i5. :trashpod:

Unknown User•3w ago

Message Not Public

Gaming

Programming

very slow 5090 pod

Did you find this page helpful?