R
RunPod•6d ago
bghira

very slow 5090 pod

hello, this pod a02462e46395 seems to be terribly slow. i'm trying to install flash_attn and it's building for more than 30 minutes. can someone please check?
38 Replies
Madiator2011
Madiator2011•6d ago
flash_attn is slow on any gpu's if you build from source, install from whl files instead
bghira
bghiraOP•6d ago
it never takes this long on other providers to build, maybe the cpu is misbehaving i need to build it from source; i am developing extensions it is still building, i don't think it will ever complete still building @Dj
Dj
Dj•6d ago
Let me look, we had this problem yesterday but I thought we fixed it You could also be facing another issue, building flash_attn just sort of sucks Are you sure that's the pod id? I'm thinking you meant gth4o6vnnoowy8.
bghira
bghiraOP•6d ago
oh yes, gth4o6vnnoowy8
Dj
Dj•6d ago
I wonder what we can do to change the hostname of pods at the prompt This is id the first few digits of the container hash
bghira
bghiraOP•6d ago
export PS1 :KEKLEO:
Dj
Dj•6d ago
So it's trackable for me, but its not as straightforward could :Hmm:
bghira
bghiraOP•6d ago
i should have dbl checked, my fault
Dj
Dj•6d ago
No it's fine, it works but it would just be nice in general No actually we totally can :thinkMan: We know the pod id as an env var
Dj
Dj•6d ago
No description
Dj
Dj•6d ago
Btu the server your pod is on is still responsive, I can see 2 vcores are under a load from about when you complained
Dj
Dj•6d ago
Can you tell when you started your job
No description
bghira
bghiraOP•6d ago
i think that's the job yea it looks like it's at 1.5GHz Curf in atop maybe the scheduler is struggling
bghira
bghiraOP•6d ago
like it's throttled to min_freq
No description
Dj
Dj•6d ago
I'm not sure what the unit is for "System Load" I didnt even have access to this tool in particular until the other day
bghira
bghiraOP•6d ago
it's probably sysioload, includes cpu load and iowait
21:16:37 up 27 days, 12:53, 0 users, load average: 2.06, 2.04, 2.04
21:16:37 up 27 days, 12:53, 0 users, load average: 2.06, 2.04, 2.04
that ^ @Dj any way you can try and set the CPU governor to performance mode to see if the problem disappears? ok it took 4 hours to build flash_attn at lowest freq and then i terminated it because the performance of the 5090s is very poor vs 4090
riverfog7
riverfog7•6d ago
@bghira use prebuilt wheel For flash attention It takes about 30min even if the cou is normal
bghira
bghiraOP•6d ago
i have to build it.
Madiator2011
Madiator2011•6d ago
then you will have to wait 30 min+ as it takes that long on my local pc too tried to set: TORCH_CUDA_ARCH_LIST=
bghira
bghiraOP•5d ago
it took 4 hours and the cpu was at 1.5ghz @dj or someone else can handle the issue guys, you dont have to keep commenting, the cpu should have been around 3.8ghz and is unnecessarily throttled when you guys get a chance, can i please get credits for this pod since the hardware was not functioning correctly
Dj
Dj•5d ago
@yhlong00000 Can you take a look at this?
riverfog7
riverfog7•5d ago
sad
bghira
bghiraOP•5d ago
well i did preserve the one i built, since it took so long. 🙂 but it'll only be so useful until next build
riverfog7
riverfog7•5d ago
why do you have to build it tho
bghira
bghiraOP•5d ago
working on updates to it 😄 i needed to test it on a 5090
riverfog7
riverfog7•5d ago
ah 5090 makes sense
bghira
bghiraOP•5d ago
i don't have one but i see hosts with 8.. in one server.. i'm like.. how lol
riverfog7
riverfog7•5d ago
isnt it standard H100s are 8 per server A100s too
bghira
bghiraOP•4d ago
any update? (cc @yhlong00000)
yhlong00000
yhlong00000•4d ago
We use different CPUs in EU-RO-1 and EUR-IS-1. I’m curious if you’ve tried running your workload in EU-RO-1, would there be a difference.
bghira
bghiraOP•3d ago
no @yhlong00000 i'm pretty convinced you guys are purposely throttling everything
Model name: AMD EPYC 9555 64-Core Processor
CPU family: 26
Model: 2
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 4409.3750
CPU min MHz: 1500.0000
BogoMIPS: 6399.98
Model name: AMD EPYC 9555 64-Core Processor
CPU family: 26
Model: 2
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 4409.3750
CPU min MHz: 1500.0000
BogoMIPS: 6399.98
why on EARTH is this CPU only ever at 1.5GHz? this is a $7.99/hr B200 instance! need this escalated please
Dj
Dj•3d ago
I do want to note that we're not applying any throttling to any of our hardware - especially not Secured Cloud servers as they're fully under our control. It makes sense to see the CPU min reported as 1500, but it doesn't make sense that the CPU didn't speed up to help you during code compilation. I'm getting this looked into now (and maybe into tomorrow, not sure how fast we can move here on a Sunday).
yhlong00000
yhlong00000•3d ago
Just to clarify, we set CPU limits, but we’re limiting CPU time, not CPU clock speed. The processor still runs at full speed, like 4.5 GHz, but we’re controlling how much of that time your container is allowed to use it. For example, imagine a physical machine with two GPUs, and your pod is assigned one of them. The pod is also limited to 50% of the CPU, it means you’re allowed to use the CPU for only half of the time, say, 50 milliseconds out of every 100. Now, if you’re running CPU-intensive tasks for an extended period, it might feel like you’re running at half the clock speed, but what’s really happening is: you’re running at full speed during your quota window, and then getting throttled or paused for the rest. So it’s not slower per cycle, you’re just not allowed to run the whole time
bghira
bghiraOP•3d ago
:SadgeBusiness: yeah, understood, so it's not like a super low latency platform, just more for like, bulk operations or "just need vram, not speed"
riverfog7
riverfog7•3d ago
No description
JM
JM•3d ago
AMD 9555 is actually way faster and more powerfull than the reference architecture suggested by Nvidia with the B200 DGX pods. Personally, I am a big fan of the 9575F, or 9655 and 9755 for CPU heavy workloads. That said, AMD 9555 in dual configuration is a serious system, and the latest gen available for HPC purposes. To add to Hailong's point, if access to an entire 9655 CPU is expected, given there are 2 of those CPU per 8 GPU system, 4 GPU have to be rented. Hope this helps.
riverfog7
riverfog7•3d ago
Are there those numa nodes stuff That affect performance When process is spread on multiple numq nodes numa*

Did you find this page helpful?