R
RunPod•4w ago
JohnTheNerd

Pod ran out of CPU RAM

I somehow managed to run out of RAM (not VRAM, system RAM)... right after a very compute-heavy operation (calculating quantized KV-Cache scales)... while running model.save_pretrained... while the weights are still in VRAM... The pod is still running, but completely unresponsive. Now that you're done laughing at my misfortune, is there anything at all I can do to save those weights? Even enabling some swap would be completely fine... I just want the weights to save to the networked drive... Pod ID: tybrzp4aphrz3d
351 Replies
riverfog7
riverfog7•4w ago
You should contact support on their website without terminating the pod
JohnTheNerd
JohnTheNerdOP•4w ago
OK - thanks. Hopefully they get back to me soon...
riverfog7
riverfog7•4w ago
If the process got killed, there is no way to recover data soo
JohnTheNerd
JohnTheNerdOP•4w ago
I know that the process is alive and the data is still stored in VRAM. I ran into similar issues with local containers that ran out of memory, simply adding some memory (whether it's RAM or swap) will immediately bring it back to life. It's merely thrashing as it tries to clear the disk cache while new data is being written to. Still don't know how it managed to eat that much, the weights are 140GB and I have 283GB of RAM...
riverfog7
riverfog7•4w ago
Wow if its a H100 you are burning money fast Hope support reaches you soon
JohnTheNerd
JohnTheNerdOP•4w ago
It's a B200. I'm burning more money than I'd like.....
riverfog7
riverfog7•4w ago
Lol
JohnTheNerd
JohnTheNerdOP•4w ago
It would be very funny if it wasn't my pod lol
No description
riverfog7
riverfog7•4w ago
Maybe 2 instance of model loaded to system ram?
JohnTheNerd
JohnTheNerdOP•4w ago
That's very possible. I guess it might be trying to load it to RAM while it writes to disk or something Sad part is that the file I want is only a few megabytes, but the only way to get it is to call model.save_pretrained
riverfog7
riverfog7•4w ago
Ohh you running quantization?
JohnTheNerd
JohnTheNerdOP•4w ago
Not quite. I'm calculating quantized KV scale factors. The idea is to be able to quantize the KV-cache down to 8 bits while losing very very little in accuracy. You can take out an extra bit from the exponent, making kv-cache weights e4m3 (with one sign bit) instead of e5m2. However, this destroys the numerical range in which you can represent weights, since you just removed a whole bit from the floating point exponent. If you happened to have some magic numbers you can multiply each weight by, calibrated by running thousands of inference without any quantization on a very powerful GPU... You still wouldn't quite get to non-quantized quality, but you'd get quite close
riverfog7
riverfog7•4w ago
yeah saw that on vllm docs
JohnTheNerd
JohnTheNerdOP•4w ago
Unfortunately I do not have 140GB of VRAM at home to go calculate my own scale factors lol
riverfog7
riverfog7•4w ago
doesnt it work on cpu?
JohnTheNerd
JohnTheNerdOP•4w ago
... I don't have 140GB of RAM, either. It's also painfully slow on CPU, and Flash Attention won't work. AFAICT Qwen really wants Flash Attention - people are saying the model breaks pretty badly without it
riverfog7
riverfog7•4w ago
maybe the "running thousands of inference without any quantization on a very powerful GPU" part is a bottleneck if you can run it on CPU you can always rent some high mem machines
JohnTheNerd
JohnTheNerdOP•4w ago
It's not very different than simply running the model a few thousand times. But that's not very fast when you are running a 70b at full precision lol
riverfog7
riverfog7•4w ago
yeah
JohnTheNerd
JohnTheNerdOP•4w ago
Apparently they're based in New Jersey, and it's 11:30PM there ... maybe I should just stop the money burning
riverfog7
riverfog7•4w ago
i think so too
JohnTheNerd
JohnTheNerdOP•4w ago
Well, I tried lol
riverfog7
riverfog7•4w ago
I have a question: if it is running inference over and over can it do the calibration layer by layer?
JohnTheNerd
JohnTheNerdOP•4w ago
It can, actually, yeah That's a very good point Qwen 2.5 has 80 layers... One layer at a time would probably easily fit on a GPU I have at home
riverfog7
riverfog7•4w ago
probably hurts to implement tho (if there is no implementation)
JohnTheNerd
JohnTheNerdOP•4w ago
There is definitely no implementation. Even the code in vllm's docs are broken lol I had to modify it to get it to work at all Then I had the rude awakening of "you can't do this with a quantized model"... and here we are
riverfog7
riverfog7•4w ago
lol no multi-gpu implementation too?
JohnTheNerd
JohnTheNerdOP•4w ago
Nope
riverfog7
riverfog7•4w ago
sad
JohnTheNerd
JohnTheNerdOP•4w ago
Well, maybe actually Not that it matters with 140GB of weights lol
JohnTheNerd
JohnTheNerdOP•4w ago
GitHub
GitHub - vllm-project/llm-compressor: Transformers-compatible libra...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
JohnTheNerd
JohnTheNerdOP•4w ago
They just pass in an AutoModelForCausalLM.from_pretrained model to the library
riverfog7
riverfog7•4w ago
GitHub
The new version 0.3.0 takes a long time for quantization and eventu...
Describe the bug I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to th...
JohnTheNerd
JohnTheNerdOP•4w ago
Interesting - that allows multi-GPU. I wonder if I could implement some sort of per-layer processing... It would be miserably slow for sure, especially since I can't do much batching without seriously modifying the library
riverfog7
riverfog7•4w ago
GitHub
llm-compressor/examples/big_models_with_accelerate/README.md at mai...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
JohnTheNerd
JohnTheNerdOP•4w ago
Hmm... combining them all, I just might be able to fit it in a lot of small GPUs... saving lots of money. Probably still un-doable at home though since I don't think I can pass a quantized model through at all Thank you!
riverfog7
riverfog7•4w ago
try 4x A40s 192GB VRAM and about 1/6 of B200 pricing
JohnTheNerd
JohnTheNerdOP•4w ago
How can one get more RAM in runpod? even if it's swap My understanding is that because I'm in a container, I can't just add in swap
riverfog7
riverfog7•4w ago
Same for me
riverfog7
riverfog7•4w ago
GitHub
OOM during save_pretrained of compressed model Ā· Issue #1183 Ā· vl...
Describe the bug The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU. This was hitting the llmcompressor's modified save_pretrained_wrapper from llm-...
riverfog7
riverfog7•4w ago
same issue looks like quanting on cpu should be possible
JohnTheNerd
JohnTheNerdOP•4w ago
Yes, I can see the same frustration in the comments section lol Maybe I'll just go make an EC2 instance with a lot of EBS storage, enable swap, and go away for a month lol Probably cheaper...
riverfog7
riverfog7•4w ago
try these
JohnTheNerd
JohnTheNerdOP•4w ago
Hm? I didn't see any suggestions in the GitHub issue
riverfog7
riverfog7•4w ago
they are cheap with spot requests
No description
riverfog7
riverfog7•4w ago
no GPUs tho
JohnTheNerd
JohnTheNerdOP•4w ago
I suspect if I'm going CPU, I can go much much cheaper
riverfog7
riverfog7•4w ago
yeah and if you are going spot, look for spot savings=90% noone is using them and they dont get terminated as often
JohnTheNerd
JohnTheNerdOP•4w ago
Spot seems iffy. I use AWS at work and have been evicted before - especially for long-running workloads But nothing stops me from getting like a c7a.medium for a month, just letting it churn all day all night, with some EBS as swap
riverfog7
riverfog7•4w ago
thats right and go with instance store rather than EBS if that's possible
JohnTheNerd
JohnTheNerdOP•4w ago
That's true - it's gonna be a lot faster
riverfog7
riverfog7•4w ago
NVME powerr šŸ˜„
JohnTheNerd
JohnTheNerdOP•4w ago
Yes lol Anyway thanks a lot for your help! Although the results are gone, hope my mistake at least gave people a laugh lol
riverfog7
riverfog7•4w ago
I have the same experience with 70B models on a H100 can relate
Jason
Jason•4w ago
Filter out when you create a pod Well I feel kind of sad for your lost of progress
JohnTheNerd
JohnTheNerdOP•4w ago
I... couldn't find enough RAM. Maybe it speaks to my horrifying setup, but the pod I was on had 260 something gigabytes and I OOM'd it... I do too, but such is life. I want to re-run it regardless but I need a genuinely stupid amount of RAM to assure myself this will never ever happen again lol
Jason
Jason•4w ago
Hmm I think the only way is More gpus
JohnTheNerd
JohnTheNerdOP•4w ago
My guess is that torch tried to copy the weights to RAM... twice. No idea why it would happen. Seeing I am working with a 72b at bf16, that's 288+GB of RAM I need
Jason
Jason•4w ago
Or other gpu types
JohnTheNerd
JohnTheNerdOP•4w ago
I had a B200, VRAM isn't an issue. System RAM is It can even be swap tbh, but since I'm in a container, I can't have my own swap. Someone has to give it to me during the docker run command. And runpod doesn't have such an option sadly...
Jason
Jason•4w ago
Yeah I mean try sliding right that gpu count slider And then you'll see the pod will have more ram, we'll if you use it for ram only it'll be waste too
JohnTheNerd
JohnTheNerdOP•4w ago
That's a good point I could get lots of cheaper GPUs and tensor parallelize, but the RAM was what killed my workflow from the start
Jason
Jason•4w ago
I think there is shm In /dev/shm, don't know if that's usable for you Check your training script again heheh
JohnTheNerd
JohnTheNerdOP•4w ago
I suspect not. It's all abstracted away from me - torch is what eats the RAM the line that killed my pod was model.save_pretrained() and it's hard to avoid that lol
Jason
Jason•4w ago
Can chatgpt provide with a reasonable explanation of why Maybe it can explain hf's code lol
JohnTheNerd
JohnTheNerdOP•4w ago
Possibly lol I should ask after work
Jason
Jason•4w ago
Maybe got to do with training and then saving it
JohnTheNerd
JohnTheNerdOP•4w ago
The training code is rather brief - nothing crazy, some open source code from vllm that I modified to work with another LLM It's not even training code - it's quantization code
Jason
Jason•4w ago
How long did it took you? Ic Calibrating?
JohnTheNerd
JohnTheNerdOP•4w ago
A few hours on a B200, with another few spent on various failures I'm getting scales for a kv-cache quantization. I run LLMs at home and I need to quantize my kv-cache down to 8bpw with minimal loss https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache.html except the example code is broken
riverfog7
riverfog7•4w ago
Ways to go bankrupt fast 😦
Jason
Jason•4w ago
Oh people actually do use high-end gpu for that
JohnTheNerd
JohnTheNerdOP•4w ago
You need to run inference on unquantized model for it... and I am running a 72b at home
riverfog7
riverfog7•4w ago
Aws does it on a 8xH100
JohnTheNerd
JohnTheNerdOP•4w ago
You don't need a high-end GPU for vllm. Well, I don't. But I have an insane setup lol
Jason
Jason•4w ago
Ooh For which of their models
JohnTheNerd
JohnTheNerdOP•4w ago
This runs qwen 2.5 72b with 14k context on 2x3090, with the nice PagedAttention that lets you serve many people at once:
Jason
Jason•4w ago
What about the new llama oh wait even a pair of 4090 doesnt run it
JohnTheNerd
JohnTheNerdOP•4w ago
#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"
#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"
Jason
Jason•4w ago
That's cool
JohnTheNerd
JohnTheNerdOP•4w ago
Too big and not worth the extra VRAM. 104B for what roughly benchmarks the same as qwen 2.5/llama 3.3
Jason
Jason•4w ago
Ohh
JohnTheNerd
JohnTheNerdOP•4w ago
I could squeeze even more quality out of this poor server if I could get the KV-cache scales Unfortunately for that I need to put a lot more money in my account lol Maybe next paycheck... If I can sanely get the RAM...
riverfog7
riverfog7•4w ago
I have a question Why do you need to run a fill model for that kv cache scaling
Jason
Jason•4w ago
No other people had even did this for qwen before?
riverfog7
riverfog7•4w ago
If you are running a quantized model
JohnTheNerd
JohnTheNerdOP•4w ago
I found no example of such I am just too short for fp16 I can run an AWQ quant of a 70b with 13k context on fp16 kv-cache. But those 2 billion extra parameters make it not fit at all I am 900MB short, and going below AWQ is a significant hit in answer quality. I can get 14k context on fp8 with a 72B model, but at that point I have another choice: mantissa bits vs exponent bits I currently run with 5 exponent bits and 2 mantissa bits. It visibly impacts quality. If i can get the scales, I can cut out another bit from the exponent and give it to the mantissa, while still being very close to a fp16 KV-cache ... it's completely insane, I've been working on this setup for years
riverfog7
riverfog7•4w ago
Hmm
JohnTheNerd
JohnTheNerdOP•4w ago
Years of /r/LocalLLaMA lol I am still amazed that we can run something in our house that somewhat rivals cloud LLM's
riverfog7
riverfog7•4w ago
Imma try the cache scaling
JohnTheNerd
JohnTheNerdOP•4w ago
It's expensive on large models Definitely make sure you have more than 2x system RAM vs your weight size lol Don't make the mistake I made Would you like my script?
Jason
Jason•4w ago
Sure
riverfog7
riverfog7•4w ago
Sure (Proceeds to try ut on a A40)
JohnTheNerd
JohnTheNerdOP•4w ago
installing flash-attn takes a long time The MAX_JOBS I set is for the RAM I have. Might OOM your system (it took over 100GB ram during compilation afaict) A networked drive is super useful. You can use a CPU instance to download model weights into /workspace, and set up a virtual env to run the pip commands without eating precious GPU machine hours.
riverfog7
riverfog7•4w ago
Probably because the venv is in a network volume
JohnTheNerd
JohnTheNerdOP•4w ago
No it is just a stupidly compute heavy process. I didn't use the networked venv for it, hindsight is 20/20 It eats up a huge amount of RAM. Because of RunPod systems you see a huge amount of CPU cores available and this causes ninja to run lots of tasks, making you OOM, so MAX_JOBS is a must. I found that it ate 16 CPU cores consistently for 30ish minutes - hence I recommend the networked venv
riverfog7
riverfog7•4w ago
you should install a prebuilt wheel
JohnTheNerd
JohnTheNerdOP•4w ago
I couldn't find one that works Maybe it's the B200
JohnTheNerd
JohnTheNerdOP•4w ago
GitHub
GitHub - mjun0812/flash-attention-prebuild-wheels: Provide with pre...
Provide with pre-build flash-attention package wheels using GitHub Actions - mjun0812/flash-attention-prebuild-wheels
riverfog7
riverfog7•4w ago
u need a 4bit model with 8bit kv cache ig? @JohnTheNerd (sort of) good news I think it works with 1xA40
JohnTheNerd
JohnTheNerdOP•4w ago
yep how? that won't even fit a single layer
riverfog7
riverfog7•4w ago
bad news is
riverfog7
riverfog7•4w ago
No description
riverfog7
riverfog7•4w ago
with CPU offload turned on maybe it will work with your local machine
JohnTheNerd
JohnTheNerdOP•4w ago
oh... yeah...
riverfog7
riverfog7•4w ago
# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=False,
num_gpus=1,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=False,
num_gpus=1,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
JohnTheNerd
JohnTheNerdOP•4w ago
what's the speed like?
riverfog7
riverfog7•4w ago
better(sort of) now
No description
JohnTheNerd
JohnTheNerdOP•4w ago
how many seconds per iteration?
riverfog7
riverfog7•4w ago
idk cuz it didnt even complete 1 iteration
JohnTheNerd
JohnTheNerdOP•4w ago
hahahahahaahaha
riverfog7
riverfog7•4w ago
to be fair only only about 3 minuites has passed
JohnTheNerd
JohnTheNerdOP•4w ago
assuming 5 minutes per iteration, that's... over a week for 2048 samples
riverfog7
riverfog7•4w ago
maybe ill try with 4xA40s
riverfog7
riverfog7•4w ago
lol
No description
riverfog7
riverfog7•4w ago
flash attention cannot run on meta device so it will be slower than that
Jason
Jason•4w ago
Create a new pod
riverfog7
riverfog7•4w ago
About 10sec per iteration with 4x 3090 5hrs Total
JohnTheNerd
JohnTheNerdOP•4w ago
that's actually really good how much RAM do you get on that pod?
Jason
Jason•4w ago
Try 4090
riverfog7
riverfog7•4w ago
Im broke 😦 200gigs? Im gonna pray for no OOM
JohnTheNerd
JohnTheNerdOP•4w ago
I think you'll get an OOM I had more and I got an OOM
Jason
Jason•4w ago
What is the process that you're doing called?
JohnTheNerd
JohnTheNerdOP•4w ago
I'm not sure it has an official name. I'm collecting KV-cache quantization scaling factors the vllm link above has more information
Jason
Jason•4w ago
Okay thanks!
Jason
Jason•4w ago
seems like its this part
No description
JohnTheNerd
JohnTheNerdOP•4w ago
yes it is. but the code there didn't work for me. see my script above for what does work at least until the OOM lol
riverfog7
riverfog7•4w ago
Im trying with 32 samples To see if it saves
JohnTheNerd
JohnTheNerdOP•4w ago
that makes sense the OOM doesn't kill your process. it freezes the entire pod
Jason
Jason•4w ago
Okk
riverfog7
riverfog7•4w ago
Currently praying
JohnTheNerd
JohnTheNerdOP•4w ago
I'll get a pod of my own and keep trying once I get paid until then, I'll go to sleep since I work in the AM lol
riverfog7
riverfog7•4w ago
I cant save either because of a bug it complains when a model is offloaded
riverfog7
riverfog7•4w ago
trying this
No description
riverfog7
riverfog7•4w ago
No description
riverfog7
riverfog7•4w ago
that is the maximum usage so you probably needed like 10 more gigs of ram 😦
JohnTheNerd
JohnTheNerdOP•4w ago
😭 is this all you changed for it to work with multi GPUs?
riverfog7
riverfog7•4w ago
riverfog7
riverfog7•4w ago
and your model config(?) part is wrong you need to quant it to 4bit int for it to fit in 2xRTX3090
JohnTheNerd
JohnTheNerdOP•4w ago
I was just hoping to get the kv cache stuff. I use the AWQ quant because it's much much better than a straight 4bpw quant maybe I don't even need to quantize the model lol
riverfog7
riverfog7•4w ago
yeah you can do only kv cache quants its on somewhere at the llmcompressor repo in their test suite
riverfog7
riverfog7•4w ago
GitHub
llm-compressor/tests/e2e/vLLM/recipes/kv_cache/default.yaml at main...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
riverfog7
riverfog7•4w ago
here idk why they hid it so deep
JohnTheNerd
JohnTheNerdOP•4w ago
that's sure hidden deep interesting thanks!
Jason
Jason•4w ago
That's for tests.. Maybe it should be documented in vllm
Madiator2011
Madiator2011•4w ago
if pod is using too much ram it will throw oom errors
JohnTheNerd
JohnTheNerdOP•4w ago
it does not. it just completely freezes
riverfog7
riverfog7•4w ago
The process doesnt get killed and just freezes the entire thing
Madiator2011
Madiator2011•4w ago
Yup
riverfog7
riverfog7•4w ago
With no errors
JohnTheNerd
JohnTheNerdOP•4w ago
the entire pod simply freezes - no OOM errors. I do wish we could have swap in any way... Docker supports it, one would only need it implemented in the docker run command runpod executes. the fact that system RAM is limited by GPUs without any way of swapping is extremely limiting :/
Madiator2011
Madiator2011•4w ago
You can deploy pod with higher ram
JohnTheNerd
JohnTheNerdOP•4w ago
it's especially compounded by having a limit of 6 GPUs
Madiator2011
Madiator2011•4w ago
Use filter option
JohnTheNerd
JohnTheNerdOP•4w ago
even the B200 system doesn't have enough RAM for this workload. and the RAM is simply used once, at the end, not even continuously the only way is to get 6 GPUs which is very wasteful when you just need system RAM
Madiator2011
Madiator2011•4w ago
For B200 you can get 283 GB RAM
JohnTheNerd
JohnTheNerdOP•4w ago
and even then, you cap out at some point. if you wanted to do this on slightly larger models, say Mistral Large, you're out of luck yes. that was my pod, which froze hence I wish there was some way to swap - RAM is expensive, swap is cheap. obviously I have to pay up for it, but paying for a second B200 hurts when all you want is RAM lol
Madiator2011
Madiator2011•4w ago
If it's requires more than that it could be problematic. Not that simple as swap basically uses ssd storage causing faster were off
JohnTheNerd
JohnTheNerdOP•4w ago
that's very fair - I appreciate the honesty
Madiator2011
Madiator2011•4w ago
So in both cases there is technical loss. And usually people rent pods for GPUs with lot of VRAm šŸ˜…
JohnTheNerd
JohnTheNerdOP•4w ago
I think I am the only person who needs both lol it's because of such a stupid bug, too...
Madiator2011
Madiator2011•4w ago
What kinda of bug? Tried submit issue on thier github?
JohnTheNerd
JohnTheNerdOP•4w ago
model.save_pretrained tries to write the weights to RAM. twice. you can imagine the joy it is to find that out with 150gb of weights sitting in VRAM I'm guessing it's deep in the transformers library - which is what loads the weights initially. I suspect no chance in a GitHub issue being seen lol
Madiator2011
Madiator2011•4w ago
Difusser? Or something else?
JohnTheNerd
JohnTheNerdOP•4w ago
transformers
riverfog7
riverfog7•4w ago
Cpu offloading breaks save too
JohnTheNerd
JohnTheNerdOP•4w ago
also can't have flash attention with cpu offloading my understanding from qwen 2 (not necessarily 2.5) is that it really, really likes flash attention heard many reports of broken output without flash attention
Madiator2011
Madiator2011•4w ago
So what are you doing?
riverfog7
riverfog7•4w ago
Quantization of KV cache To fp8
JohnTheNerd
JohnTheNerdOP•4w ago
short version: I'm trying to get some magical "scales" to quantize my kv cache more optimally
riverfog7
riverfog7•4w ago
With scale factors
Madiator2011
Madiator2011•4w ago
GitHub
safetensor/mmap memory leak when per-layer weights are converted do...
System Info While working on GTPQModel which does gptq quantization of hf models and load each layer on to gpu, quantize, and then move layer back to cpu for vram reduction, we noticed a huge cpu m...
JohnTheNerd
JohnTheNerdOP•4w ago
this requires me run inference on the whole model in fp16 thousands of times to calibrate a set of scalar that's interesting but I suspect is not the issue I have. I don't have any issues moving weights to the GPU, and do not convert dtypes at all the entire process runs just fine. right at the end when I call save, it eats the entire system RAM
Madiator2011
Madiator2011•4w ago
Anyway late here so bed time for me
JohnTheNerd
JohnTheNerdOP•4w ago
fair enough - maybe I'll get another pod today and try again with a lot more RAM this time I'll post here how it goes lol
riverfog7
riverfog7•4w ago
8x 3090 100percent works
JohnTheNerd
JohnTheNerdOP•4w ago
can you get 8? i thought cap was 6
riverfog7
riverfog7•4w ago
Cuda oomed it thi
JohnTheNerd
JohnTheNerdOP•4w ago
huh I'll give it a shot oh cuda oom'ed it?
riverfog7
riverfog7•4w ago
So 8 should work Yeah for 7x3090
JohnTheNerd
JohnTheNerdOP•4w ago
oh ok that's during weight loading also I'll have flash attention which saves a bit
riverfog7
riverfog7•4w ago
No it happened while quantizing and i had flash attn on
JohnTheNerd
JohnTheNerdOP•4w ago
huh, ok then 8 should work. lots of RAM too
riverfog7
riverfog7•4w ago
Use the wheels here it works well
JohnTheNerd
JohnTheNerdOP•4w ago
perfect thank you!
riverfog7
riverfog7•4w ago
No build time magic šŸ˜„
JohnTheNerd
JohnTheNerdOP•4w ago
I'll share the scales if I get it working I wasted at least an hour of B200 time on just this lol only 4090 can give me 8 at a time it seems. still workable - and a whopping 880GB RAM which should definitely be enough
riverfog7
riverfog7•4w ago
Yeah you beed about 300gigs *need
Jason
Jason•4w ago
btw why do you use GPTQModifier in the quant instead only kv_cache_scheme
riverfog7
riverfog7•4w ago
im using this now
No description
riverfog7
riverfog7•4w ago
he uses AWQ which llmcompressor does not support GPTQ was just for testing
Jason
Jason•4w ago
ohh ic thanks oh now to fp8? what model are you doing
riverfog7
riverfog7•4w ago
kv cache to fp8 weights to int4 (with AWQ) Qwen2.5-72B-Instruct this one
Jason
Jason•4w ago
ooh is the process still running?
riverfog7
riverfog7•4w ago
Yes
riverfog7
riverfog7•4w ago
No description
Jason
Jason•4w ago
ah using what gpu?
riverfog7
riverfog7•4w ago
8x Asomething With 24gig vram
Jason
Jason•4w ago
ohh a5000*?
riverfog7
riverfog7•4w ago
I think thats right
riverfog7
riverfog7•4w ago
No description
Jason
Jason•4w ago
Seems like a good deal
riverfog7
riverfog7•4w ago
actually its not my money šŸ˜„
Jason
Jason•4w ago
Ohh That's nice
riverfog7
riverfog7•4w ago
its saving now
riverfog7
riverfog7•4w ago
No description
Jason
Jason•4w ago
Wohooo will you publish it how long did it take in total
riverfog7
riverfog7•4w ago
yes maybe about 1.5hrs?
Jason
Jason•4w ago
oh quite effecient
riverfog7
riverfog7•4w ago
it says 1hr 20min for quantizing only pod uptime is 2hr due to model downloading and installing dependencies
Jason
Jason•4w ago
ic
riverfog7
riverfog7•4w ago
and bc of my stupidity
Jason
Jason•4w ago
yeah still faster than a few hours in b200 wow
riverfog7
riverfog7•4w ago
i selected the wrong wersion of pytorch šŸ˜„ saving takes a lot of time tho
riverfog7
riverfog7•4w ago
the code
riverfog7
riverfog7•4w ago
./models/Qwen2.5-72B-Instruct-W4A16-FP8-KV this should be ./models/Qwen2.5-72B-Instruct-FP8-KV
Jason
Jason•4w ago
whats the difference whats the second one? so who paid for this run haha
riverfog7
riverfog7•4w ago
W4A16 means weights quantized to 4bit and Activation 16bit but i didnt quantize any my (previous) school ig?
Jason
Jason•4w ago
woah
riverfog7
riverfog7•4w ago
got about 500$ for research funds
Jason
Jason•4w ago
ig? why i guess noicee
riverfog7
riverfog7•4w ago
technically schools property but only i can use it
Jason
Jason•4w ago
hahah okay i see
riverfog7
riverfog7•4w ago
its writing to disk now almost done
JohnTheNerd
JohnTheNerdOP•4w ago
ooo awesome! i just filled my runpod account with 15$ without checking lol I can do another model for you if you want
riverfog7
riverfog7•4w ago
its finished now
Jason
Jason•4w ago
ā˜ŗļøAnother time hahah
JohnTheNerd
JohnTheNerdOP•4w ago
meanwhile I did something that may be useful. I'm running a benchmark suite on my qwen setup. I will re-run it with the scales too
Jason
Jason•4w ago
Did you estimate after this quant it'll run on your home server or what
JohnTheNerd
JohnTheNerdOP•4w ago
unfortunately because it's local it's slooooow lol - 12000 prompts to run on two 3090s I'm already running fp8 kv cache - just with e5m2 that's what benchmarks are running on
Jason
Jason•4w ago
Oh is it a bigger version of this?
JohnTheNerd
JohnTheNerdOP•4w ago
no
riverfog7
riverfog7•4w ago
quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: {num_bits: 8, type: float, symmetric: true, strategy: tensor} is this right tho?
JohnTheNerd
JohnTheNerdOP•4w ago
yes it is
riverfog7
riverfog7•4w ago
what does symmetric mean
Jason
Jason•4w ago
It's fp8 kv cache too?
JohnTheNerd
JohnTheNerdOP•4w ago
yes e5m2 is one sign bit, 5 exponent bits, 2 mantissa bits
Jason
Jason•4w ago
And this?
riverfog7
riverfog7•4w ago
e4m3 is one sign bit 4 exp 3 mantissa
JohnTheNerd
JohnTheNerdOP•4w ago
you can choose to do e4m3 instead. but exponent in a float determines the range of numbers you can represent, which makes it awful this helps e4m3
Jason
Jason•4w ago
I have the response from openai's model if you want lol
JohnTheNerd
JohnTheNerdOP•4w ago
I don't know what symmetric is so I'm curious too lol
Jason
Jason•4w ago
So this one is e4m3? Explain what's the effect of the exponential and mantissa amount lol
JohnTheNerd
JohnTheNerdOP•4w ago
the idea is that you can try to have a list of numbers you multiply the kv-cache entirely by. this lets you get a little closer to fp16 even with the less range let's consider a floating point number. -35x10^6 the minus is the sign bit. plus or minus. we're left with 7 bits the 35 would be the mantissa. and the 6 would be the exponent (roughly, so)
Jason
Jason•4w ago
– strategy: tensor ā€ƒā€ƒā€ƒā€ƒIndicates that the quantization is applied at the tensor level (as opposed to, say, per channel), meaning the same quantization parameters might be used for entire tensors. ā€ƒā€ƒā€ƒā€“ dynamic: false ā€ƒā€ƒā€ƒā€ƒThis means that the quantization parameters (such as scaling factors) are fixed after calibration rather than being computed on the fly (ā€œdynamic quantizationā€). Fixed parameters can sometimes lead to more stable behavior. ā€ƒā€ƒā€ƒā€“ symmetric: true ā€ƒā€ƒā€ƒā€ƒSignals that the quantization should be symmetric around zero. In symmetric quantization, the ā€œzero pointā€ is fixed to zero, and the range is symmetric (e.g., –X to +X). This can simplify the arithmetic and sometimes improve performance.
JohnTheNerd
JohnTheNerdOP•4w ago
if I only have two bits for the mantissa, I cannot represent 35 anymore. I must round it to a number I can represent
Jason
Jason•4w ago
ic, still hard to understand maybe im lacking the bits thing, isok letme chatgpt it to dig deeper thanks!
JohnTheNerd
JohnTheNerdOP•4w ago
the 6 is the exponent. this effectively determines the range in which I can represent numbers - as I cannot say, have a 10^500 with a 4-bit exponent
JohnTheNerd
JohnTheNerdOP•4w ago
since 500 doesn't fit in 4 bits oooooo thank you! I'll try it out after the benchmarks run
Jason
Jason•4w ago
wow in s3?
riverfog7
riverfog7•4w ago
thats only the log and code files uploading the model now
Jason
Jason•4w ago
i thought it was the model 🤣🤣
riverfog7
riverfog7•4w ago
no way 140gigs is uploading that fast
Jason
Jason•4w ago
yes way, lucky connections
JohnTheNerd
JohnTheNerdOP•4w ago
can I have the kv scales? should be much smaller
Jason
Jason•4w ago
wait you can do that?
JohnTheNerd
JohnTheNerdOP•4w ago
so the operation we are doing doesn't actually care about model weight outputs to explain I must go back here
Jason
Jason•4w ago
the output isnt the whole model? is just the scales?
JohnTheNerd
JohnTheNerdOP•4w ago
yep the scales are basically a lot of numbers say I happened to have a lot of GPU power. I can run everything with its full precision for a little bit, paid by the hour
riverfog7
riverfog7•4w ago
how tho i saved the full model
JohnTheNerd
JohnTheNerdOP•4w ago
it's just a json file kv_cache_something i think kv_cache_scales.json https://docs.vllm.ai/en/v0.6.3/quantization/fp8_e4m3_kvcache.html
Jason
Jason•4w ago
ahh i see
JohnTheNerd
JohnTheNerdOP•4w ago
I would run thousands of prompts I would check how much that fp8 kv cache actually differs from the fp16 kv cache and come up with a set of numbers, when multiplied with parts of the fp8 cache, get as close to the fp16 versions as possible those are my scales
Jason
Jason•4w ago
in short batch processing-like right?
riverfog7
riverfog7•4w ago
it didnt save that maybe its fused
JohnTheNerd
JohnTheNerdOP•4w ago
correct interesting
Jason
Jason•4w ago
yeah because the code is like
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)
means its saves the whole modified model isnt it?
JohnTheNerd
JohnTheNerdOP•4w ago
yep, that's right
riverfog7
riverfog7•4w ago
yeah
JohnTheNerd
JohnTheNerdOP•4w ago
I found this
JohnTheNerd
JohnTheNerdOP•4w ago
GitHub
vllm/examples/fp8/extract_scales.py at v0.6.6 Ā· vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm
JohnTheNerd
JohnTheNerdOP•4w ago
GitHub
[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together faile...
Your current environment Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3...
JohnTheNerd
JohnTheNerdOP•4w ago
I believe they extract it from the model
Jason
Jason•4w ago
but yeah that isn't using the llmcompressor, yeah nvm don't really know how this extractor thing works
JohnTheNerd
JohnTheNerdOP•4w ago
llm-compressor is by vllm too. I suspect it'll work fine what happens if you don't set compressed=true I wonder
riverfog7
riverfog7•4w ago
I think It doesnt use compressed tensor format
JohnTheNerd
JohnTheNerdOP•4w ago
I see in any case I will take a better look at the weights tomorrow - I should go to bed it's 2am here lol I suspect it's uploading on your end anyway
riverfog7
riverfog7•4w ago
Yeah
Jason
Jason•4w ago
GitHub
llm-compressor/src/llmcompressor/args/README.md at c5dbf0cdb1364c40...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
riverfog7
riverfog7•4w ago
I think it is fused Got an error related to k_scales when saving with cpu offload last time
JohnTheNerd
JohnTheNerdOP•4w ago
ugh, any ideas how to extract it out?
riverfog7
riverfog7•4w ago
riverfog7/Qwen2.5-72B-Instruct-FP8-KV
JohnTheNerd
JohnTheNerdOP•4w ago
yep, it's fused
riverfog7
riverfog7•4w ago
"model.layers.14.self_attn.k_scale": "model-00006-of-00031.safetensors",
JohnTheNerd
JohnTheNerdOP•4w ago
I'll look in detail tomorrow
riverfog7
riverfog7•4w ago
okay so
JohnTheNerd
JohnTheNerdOP•4w ago
yes
riverfog7
riverfog7•4w ago
have to extract that šŸ˜„
JohnTheNerd
JohnTheNerdOP•4w ago
this can be extracted just a pain
riverfog7
riverfog7•4w ago
how about just quantizing the model to int4 with fp8 kv cache and loading that instead
JohnTheNerd
JohnTheNerdOP•4w ago
I could do that but I suspect it'll reduce quality significantly I have a different idea... I'm thinking of just taking those scales and injecting them into the safetensors file for the awq quant throwing that all in AWQ is nice because it relies on calibration to determine the most important 1.5% of weights. then it leaves those at fp16, quantizing the rest to int4 maybe a support thread in the runpod discord isn't the best place to discuss this tho lol
riverfog7
riverfog7•4w ago
No description
riverfog7
riverfog7•4w ago
its actually better than AWQ
JohnTheNerd
JohnTheNerdOP•4w ago
interesting I should try it
riverfog7
riverfog7•4w ago
its for qwen2 tho and you can calibrate while quantizing the weights
JohnTheNerd
JohnTheNerdOP•4w ago
that's true
riverfog7
riverfog7•4w ago
like the kv cache yeah
JohnTheNerd
JohnTheNerdOP•4w ago
I'll do that, yes good thing I have the 17$ on my account lol that should be way more than enough
Jason
Jason•4w ago
I don't see the scales in the model index, isnt it just a index referencing layers to the model file splits? I wonder how the scales file look like
riverfog7
riverfog7•4w ago
yeah actual data is in the .safetensors file
Jason
Jason•4w ago
Ic Is there scripts to run these benchmarks
riverfog7
riverfog7•4w ago
its from the qwen docs i think there are some llm benchmarking software so maybe use that.?
Jason
Jason•4w ago
I ma look that up and try that someday lol
riverfog7
riverfog7•4w ago
btw this thread has become VERY massive
Jason
Jason•4w ago
Hahah no worries right Great job
riverfog7
riverfog7•4w ago
KV cache calibration is finished GPTQ quanting left
riverfog7
riverfog7•4w ago
No description
riverfog7
riverfog7•4w ago
I think he will need more than 15$ for the quantizing
riverfog7
riverfog7•4w ago
No description
riverfog7
riverfog7•4w ago
No description
Jason
Jason•4w ago
Btw why do you choose the gptq quant instead like fp8 or anything else
riverfog7
riverfog7•4w ago
his gpu is 2x3090 fp8 doesnt fit in that
Jason
Jason•4w ago
int4
riverfog7
riverfog7•4w ago
and the benchmarks said that gptq 4bit performs better than awq so i went with int4 weights fp16 activations with GPTQ its int4-W4A16
Jason
Jason•4w ago
oh the int4 is using gptq in vllm docs
riverfog7
riverfog7•4w ago
idk about that tho there's two types of quantization methods in llmcompressor
Jason
Jason•4w ago
i see hmm what is the other one?
riverfog7
riverfog7•4w ago
QuantizationModifier and GPTQModifier
Jason
Jason•4w ago
meta's ai is free in whatsapp wow
riverfog7
riverfog7•4w ago
idk whats the difference but i used GPTQModifier
Jason
Jason•4w ago
i think gptq is more complicated
riverfog7
riverfog7•4w ago
okay its finished finally
Jason
Jason•4w ago
Yay
riverfog7
riverfog7•4w ago
fuck disk quota exceeded
Jason
Jason•4w ago
Oof Delete some other model
riverfog7
riverfog7•4w ago
(needs to wait another 5 hours)
Jason
Jason•4w ago
Hmm What for? You mean repeat the process?
riverfog7
riverfog7•4w ago
yeah its on a py file the process got killed
Jason
Jason•4w ago
The importance of error handling šŸ˜… that's sad
riverfog7
riverfog7•4w ago
Im on 2xH200 now much faster iterations per second instead of seconds per iteration
Jason
Jason•4w ago
Hahah
riverfog7
riverfog7•3w ago
about 1hr left i hate myself It sort of finished but why is the safetensors file size simillar to the original model if it is a 4bit quantized model something's wrong
Jason
Jason•3w ago
Howd you save it
riverfog7
riverfog7•3w ago
it saves by itself
No description
riverfog7
riverfog7•3w ago
the recipie
No description
Jason
Jason•3w ago
Ah try load it then
riverfog7
riverfog7•3w ago
after uploading ill try loading with 2xA40 should work
JohnTheNerd
JohnTheNerdOP•3w ago
that doesn't sound right
JohnTheNerd
JohnTheNerdOP•3w ago
that quantization config looks wrong
riverfog7
riverfog7•3w ago
its this
JohnTheNerd
JohnTheNerdOP•3w ago
I failed to figure it out lol
riverfog7
riverfog7•3w ago
Lol

Did you find this page helpful?