very slow 5090 pod
hello, this pod
a02462e46395
seems to be terribly slow. i'm trying to install flash_attn and it's building for more than 30 minutes. can someone please check?38 Replies
flash_attn is slow on any gpu's if you build from source, install from whl files instead
it never takes this long on other providers to build, maybe the cpu is misbehaving
i need to build it from source; i am developing extensions
it is still building, i don't think it will ever complete
still building
@Dj
Let me look, we had this problem yesterday but I thought we fixed it
You could also be facing another issue, building flash_attn just sort of sucks
Are you sure that's the pod id?
I'm thinking you meant gth4o6vnnoowy8.
oh
yes,
gth4o6vnnoowy8
I wonder what we can do to change the hostname of pods at the prompt
This is id the first few digits of the container hash
export PS1 :KEKLEO:
So it's trackable for me, but its not as straightforward
could :Hmm:
i should have dbl checked, my fault
No it's fine, it works but it would just be nice in general
No actually we totally can :thinkMan:
We know the pod id as an env var

Btu the server your pod is on is still responsive, I can see 2 vcores are under a load from about when you complained
Can you tell when you started your job

i think that's the job yea
it looks like it's at 1.5GHz Curf in atop
maybe the scheduler is struggling
like it's throttled to min_freq

I'm not sure what the unit is for "System Load" I didnt even have access to this tool in particular until the other day
it's probably sysioload, includes cpu load and iowait
that ^
@Dj any way you can try and set the CPU governor to performance mode to see if the problem disappears?
ok it took 4 hours to build flash_attn at lowest freq and then i terminated it because the performance of the 5090s is very poor vs 4090
@bghira use prebuilt wheel
For flash attention
It takes about 30min even if the cou is normal
i have to build it.
then you will have to wait 30 min+ as it takes that long on my local pc too
tried to set:
TORCH_CUDA_ARCH_LIST=
it took 4 hours and the cpu was at 1.5ghz
@dj or someone else can handle the issue guys, you dont have to keep commenting,
the cpu should have been around 3.8ghz and is unnecessarily throttled
when you guys get a chance, can i please get credits for this pod since the hardware was not functioning correctly
@yhlong00000 Can you take a look at this?
sad
well i did preserve the one i built, since it took so long. 🙂
but it'll only be so useful until next build
why do you have to build it tho
working on updates to it 😄
i needed to test it on a 5090
ah 5090 makes sense
i don't have one but i see hosts with 8.. in one server.. i'm like.. how lol
isnt it standard
H100s are 8 per server
A100s too
any update? (cc @yhlong00000)
We use different CPUs in EU-RO-1 and EUR-IS-1. I’m curious if you’ve tried running your workload in EU-RO-1, would there be a difference.
no @yhlong00000 i'm pretty convinced you guys are purposely throttling everything
why on EARTH is this CPU only ever at 1.5GHz?
this is a $7.99/hr B200 instance!
need this escalated please
I do want to note that we're not applying any throttling to any of our hardware - especially not Secured Cloud servers as they're fully under our control.
It makes sense to see the CPU min reported as 1500, but it doesn't make sense that the CPU didn't speed up to help you during code compilation. I'm getting this looked into now (and maybe into tomorrow, not sure how fast we can move here on a Sunday).
Just to clarify, we set CPU limits, but we’re limiting CPU time, not CPU clock speed. The processor still runs at full speed, like 4.5 GHz, but we’re controlling how much of that time your container is allowed to use it.
For example, imagine a physical machine with two GPUs, and your pod is assigned one of them. The pod is also limited to 50% of the CPU, it means you’re allowed to use the CPU for only half of the time, say, 50 milliseconds out of every 100.
Now, if you’re running CPU-intensive tasks for an extended period, it might feel like you’re running at half the clock speed, but what’s really happening is: you’re running at full speed during your quota window, and then getting throttled or paused for the rest. So it’s not slower per cycle, you’re just not allowed to run the whole time
:SadgeBusiness: yeah, understood, so it's not like a super low latency platform, just more for like, bulk operations or "just need vram, not speed"

AMD 9555 is actually way faster and more powerfull than the reference architecture suggested by Nvidia with the B200 DGX pods.
Personally, I am a big fan of the 9575F, or 9655 and 9755 for CPU heavy workloads.
That said, AMD 9555 in dual configuration is a serious system, and the latest gen available for HPC purposes.
To add to Hailong's point, if access to an entire 9655 CPU is expected, given there are 2 of those CPU per 8 GPU system, 4 GPU have to be rented.
Hope this helps.
Are there those numa nodes stuff
That affect performance
When process is spread on multiple numq nodes
numa*