Pod ran out of CPU RAM

I somehow managed to run out of RAM (not VRAM, system RAM)... right after a very compute-heavy operation (calculating quantized KV-Cache scales)... while running model.save_pretrained... while the weights are still in VRAM... The pod is still running, but completely unresponsive. Now that you're done laughing at my misfortune, is there anything at all I can do to save those weights? Even enabling some swap would be completely fine... I just want the weights to save to the networked drive... Pod ID: tybrzp4aphrz3d
350 Replies
riverfog7
riverfog76mo ago
You should contact support on their website without terminating the pod
JohnTheNerd
JohnTheNerdOP6mo ago
OK - thanks. Hopefully they get back to me soon...
riverfog7
riverfog76mo ago
If the process got killed, there is no way to recover data soo
JohnTheNerd
JohnTheNerdOP6mo ago
I know that the process is alive and the data is still stored in VRAM. I ran into similar issues with local containers that ran out of memory, simply adding some memory (whether it's RAM or swap) will immediately bring it back to life. It's merely thrashing as it tries to clear the disk cache while new data is being written to. Still don't know how it managed to eat that much, the weights are 140GB and I have 283GB of RAM...
riverfog7
riverfog76mo ago
Wow if its a H100 you are burning money fast Hope support reaches you soon
JohnTheNerd
JohnTheNerdOP6mo ago
It's a B200. I'm burning more money than I'd like.....
riverfog7
riverfog76mo ago
Lol
JohnTheNerd
JohnTheNerdOP6mo ago
It would be very funny if it wasn't my pod lol
No description
riverfog7
riverfog76mo ago
Maybe 2 instance of model loaded to system ram?
JohnTheNerd
JohnTheNerdOP6mo ago
That's very possible. I guess it might be trying to load it to RAM while it writes to disk or something Sad part is that the file I want is only a few megabytes, but the only way to get it is to call model.save_pretrained
riverfog7
riverfog76mo ago
Ohh you running quantization?
JohnTheNerd
JohnTheNerdOP6mo ago
Not quite. I'm calculating quantized KV scale factors. The idea is to be able to quantize the KV-cache down to 8 bits while losing very very little in accuracy. You can take out an extra bit from the exponent, making kv-cache weights e4m3 (with one sign bit) instead of e5m2. However, this destroys the numerical range in which you can represent weights, since you just removed a whole bit from the floating point exponent. If you happened to have some magic numbers you can multiply each weight by, calibrated by running thousands of inference without any quantization on a very powerful GPU... You still wouldn't quite get to non-quantized quality, but you'd get quite close
riverfog7
riverfog76mo ago
yeah saw that on vllm docs
JohnTheNerd
JohnTheNerdOP6mo ago
Unfortunately I do not have 140GB of VRAM at home to go calculate my own scale factors lol
riverfog7
riverfog76mo ago
doesnt it work on cpu?
JohnTheNerd
JohnTheNerdOP6mo ago
... I don't have 140GB of RAM, either. It's also painfully slow on CPU, and Flash Attention won't work. AFAICT Qwen really wants Flash Attention - people are saying the model breaks pretty badly without it
riverfog7
riverfog76mo ago
maybe the "running thousands of inference without any quantization on a very powerful GPU" part is a bottleneck if you can run it on CPU you can always rent some high mem machines
JohnTheNerd
JohnTheNerdOP6mo ago
It's not very different than simply running the model a few thousand times. But that's not very fast when you are running a 70b at full precision lol
riverfog7
riverfog76mo ago
yeah
JohnTheNerd
JohnTheNerdOP6mo ago
Apparently they're based in New Jersey, and it's 11:30PM there ... maybe I should just stop the money burning
riverfog7
riverfog76mo ago
i think so too
JohnTheNerd
JohnTheNerdOP6mo ago
Well, I tried lol
riverfog7
riverfog76mo ago
I have a question: if it is running inference over and over can it do the calibration layer by layer?
JohnTheNerd
JohnTheNerdOP6mo ago
It can, actually, yeah That's a very good point Qwen 2.5 has 80 layers... One layer at a time would probably easily fit on a GPU I have at home
riverfog7
riverfog76mo ago
probably hurts to implement tho (if there is no implementation)
JohnTheNerd
JohnTheNerdOP6mo ago
There is definitely no implementation. Even the code in vllm's docs are broken lol I had to modify it to get it to work at all Then I had the rude awakening of "you can't do this with a quantized model"... and here we are
riverfog7
riverfog76mo ago
lol no multi-gpu implementation too?
JohnTheNerd
JohnTheNerdOP6mo ago
Nope
riverfog7
riverfog76mo ago
sad
JohnTheNerd
JohnTheNerdOP6mo ago
Well, maybe actually Not that it matters with 140GB of weights lol
JohnTheNerd
JohnTheNerdOP6mo ago
GitHub
GitHub - vllm-project/llm-compressor: Transformers-compatible libra...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
JohnTheNerd
JohnTheNerdOP6mo ago
They just pass in an AutoModelForCausalLM.from_pretrained model to the library
riverfog7
riverfog76mo ago
GitHub
The new version 0.3.0 takes a long time for quantization and eventu...
Describe the bug I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to th...
JohnTheNerd
JohnTheNerdOP6mo ago
Interesting - that allows multi-GPU. I wonder if I could implement some sort of per-layer processing... It would be miserably slow for sure, especially since I can't do much batching without seriously modifying the library
riverfog7
riverfog76mo ago
GitHub
llm-compressor/examples/big_models_with_accelerate/README.md at mai...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
JohnTheNerd
JohnTheNerdOP6mo ago
Hmm... combining them all, I just might be able to fit it in a lot of small GPUs... saving lots of money. Probably still un-doable at home though since I don't think I can pass a quantized model through at all Thank you!
riverfog7
riverfog76mo ago
try 4x A40s 192GB VRAM and about 1/6 of B200 pricing
JohnTheNerd
JohnTheNerdOP6mo ago
How can one get more RAM in runpod? even if it's swap My understanding is that because I'm in a container, I can't just add in swap
riverfog7
riverfog76mo ago
Same for me
riverfog7
riverfog76mo ago
GitHub
OOM during save_pretrained of compressed model · Issue #1183 · vl...
Describe the bug The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU. This was hitting the llmcompressor's modified save_pretrained_wrapper from llm-...
riverfog7
riverfog76mo ago
same issue looks like quanting on cpu should be possible
JohnTheNerd
JohnTheNerdOP6mo ago
Yes, I can see the same frustration in the comments section lol Maybe I'll just go make an EC2 instance with a lot of EBS storage, enable swap, and go away for a month lol Probably cheaper...
riverfog7
riverfog76mo ago
try these
JohnTheNerd
JohnTheNerdOP6mo ago
Hm? I didn't see any suggestions in the GitHub issue
riverfog7
riverfog76mo ago
they are cheap with spot requests
No description
riverfog7
riverfog76mo ago
no GPUs tho
JohnTheNerd
JohnTheNerdOP6mo ago
I suspect if I'm going CPU, I can go much much cheaper
riverfog7
riverfog76mo ago
yeah and if you are going spot, look for spot savings=90% noone is using them and they dont get terminated as often
JohnTheNerd
JohnTheNerdOP6mo ago
Spot seems iffy. I use AWS at work and have been evicted before - especially for long-running workloads But nothing stops me from getting like a c7a.medium for a month, just letting it churn all day all night, with some EBS as swap
riverfog7
riverfog76mo ago
thats right and go with instance store rather than EBS if that's possible
JohnTheNerd
JohnTheNerdOP6mo ago
That's true - it's gonna be a lot faster
riverfog7
riverfog76mo ago
NVME powerr 😄
JohnTheNerd
JohnTheNerdOP6mo ago
Yes lol Anyway thanks a lot for your help! Although the results are gone, hope my mistake at least gave people a laugh lol
riverfog7
riverfog76mo ago
I have the same experience with 70B models on a H100 can relate
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I... couldn't find enough RAM. Maybe it speaks to my horrifying setup, but the pod I was on had 260 something gigabytes and I OOM'd it... I do too, but such is life. I want to re-run it regardless but I need a genuinely stupid amount of RAM to assure myself this will never ever happen again lol
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
My guess is that torch tried to copy the weights to RAM... twice. No idea why it would happen. Seeing I am working with a 72b at bf16, that's 288+GB of RAM I need
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I had a B200, VRAM isn't an issue. System RAM is It can even be swap tbh, but since I'm in a container, I can't have my own swap. Someone has to give it to me during the docker run command. And runpod doesn't have such an option sadly...
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
That's a good point I could get lots of cheaper GPUs and tensor parallelize, but the RAM was what killed my workflow from the start
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I suspect not. It's all abstracted away from me - torch is what eats the RAM the line that killed my pod was model.save_pretrained() and it's hard to avoid that lol
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
Possibly lol I should ask after work
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
The training code is rather brief - nothing crazy, some open source code from vllm that I modified to work with another LLM It's not even training code - it's quantization code
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
A few hours on a B200, with another few spent on various failures I'm getting scales for a kv-cache quantization. I run LLMs at home and I need to quantize my kv-cache down to 8bpw with minimal loss https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache.html except the example code is broken
riverfog7
riverfog76mo ago
Ways to go bankrupt fast 😦
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
You need to run inference on unquantized model for it... and I am running a 72b at home
riverfog7
riverfog76mo ago
Aws does it on a 8xH100
JohnTheNerd
JohnTheNerdOP6mo ago
You don't need a high-end GPU for vllm. Well, I don't. But I have an insane setup lol
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
This runs qwen 2.5 72b with 14k context on 2x3090, with the nice PagedAttention that lets you serve many people at once:
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"
#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
Too big and not worth the extra VRAM. 104B for what roughly benchmarks the same as qwen 2.5/llama 3.3
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I could squeeze even more quality out of this poor server if I could get the KV-cache scales Unfortunately for that I need to put a lot more money in my account lol Maybe next paycheck... If I can sanely get the RAM...
riverfog7
riverfog76mo ago
I have a question Why do you need to run a fill model for that kv cache scaling
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
If you are running a quantized model
JohnTheNerd
JohnTheNerdOP6mo ago
I found no example of such I am just too short for fp16 I can run an AWQ quant of a 70b with 13k context on fp16 kv-cache. But those 2 billion extra parameters make it not fit at all I am 900MB short, and going below AWQ is a significant hit in answer quality. I can get 14k context on fp8 with a 72B model, but at that point I have another choice: mantissa bits vs exponent bits I currently run with 5 exponent bits and 2 mantissa bits. It visibly impacts quality. If i can get the scales, I can cut out another bit from the exponent and give it to the mantissa, while still being very close to a fp16 KV-cache ... it's completely insane, I've been working on this setup for years
riverfog7
riverfog76mo ago
Hmm
JohnTheNerd
JohnTheNerdOP6mo ago
Years of /r/LocalLLaMA lol I am still amazed that we can run something in our house that somewhat rivals cloud LLM's
riverfog7
riverfog76mo ago
Imma try the cache scaling
JohnTheNerd
JohnTheNerdOP6mo ago
It's expensive on large models Definitely make sure you have more than 2x system RAM vs your weight size lol Don't make the mistake I made Would you like my script?
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
Sure (Proceeds to try ut on a A40)
JohnTheNerd
JohnTheNerdOP6mo ago
installing flash-attn takes a long time The MAX_JOBS I set is for the RAM I have. Might OOM your system (it took over 100GB ram during compilation afaict) A networked drive is super useful. You can use a CPU instance to download model weights into /workspace, and set up a virtual env to run the pip commands without eating precious GPU machine hours.
riverfog7
riverfog76mo ago
Probably because the venv is in a network volume
JohnTheNerd
JohnTheNerdOP6mo ago
No it is just a stupidly compute heavy process. I didn't use the networked venv for it, hindsight is 20/20 It eats up a huge amount of RAM. Because of RunPod systems you see a huge amount of CPU cores available and this causes ninja to run lots of tasks, making you OOM, so MAX_JOBS is a must. I found that it ate 16 CPU cores consistently for 30ish minutes - hence I recommend the networked venv
riverfog7
riverfog76mo ago
you should install a prebuilt wheel
JohnTheNerd
JohnTheNerdOP6mo ago
I couldn't find one that works Maybe it's the B200
JohnTheNerd
JohnTheNerdOP6mo ago
GitHub
GitHub - mjun0812/flash-attention-prebuild-wheels: Provide with pre...
Provide with pre-build flash-attention package wheels using GitHub Actions - mjun0812/flash-attention-prebuild-wheels
riverfog7
riverfog76mo ago
u need a 4bit model with 8bit kv cache ig? @JohnTheNerd (sort of) good news I think it works with 1xA40
JohnTheNerd
JohnTheNerdOP6mo ago
yep how? that won't even fit a single layer
riverfog7
riverfog76mo ago
bad news is
riverfog7
riverfog76mo ago
No description
riverfog7
riverfog76mo ago
with CPU offload turned on maybe it will work with your local machine
JohnTheNerd
JohnTheNerdOP6mo ago
oh... yeah...
riverfog7
riverfog76mo ago
# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=False,
num_gpus=1,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=False,
num_gpus=1,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
JohnTheNerd
JohnTheNerdOP6mo ago
what's the speed like?
riverfog7
riverfog76mo ago
better(sort of) now
No description
JohnTheNerd
JohnTheNerdOP6mo ago
how many seconds per iteration?
riverfog7
riverfog76mo ago
idk cuz it didnt even complete 1 iteration
JohnTheNerd
JohnTheNerdOP6mo ago
hahahahahaahaha
riverfog7
riverfog76mo ago
to be fair only only about 3 minuites has passed
JohnTheNerd
JohnTheNerdOP6mo ago
assuming 5 minutes per iteration, that's... over a week for 2048 samples
riverfog7
riverfog76mo ago
maybe ill try with 4xA40s
riverfog7
riverfog76mo ago
lol
No description
riverfog7
riverfog76mo ago
flash attention cannot run on meta device so it will be slower than that
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
About 10sec per iteration with 4x 3090 5hrs Total
JohnTheNerd
JohnTheNerdOP6mo ago
that's actually really good how much RAM do you get on that pod?
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
Im broke 😦 200gigs? Im gonna pray for no OOM
JohnTheNerd
JohnTheNerdOP6mo ago
I think you'll get an OOM I had more and I got an OOM
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I'm not sure it has an official name. I'm collecting KV-cache quantization scaling factors the vllm link above has more information
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
yes it is. but the code there didn't work for me. see my script above for what does work at least until the OOM lol
riverfog7
riverfog76mo ago
Im trying with 32 samples To see if it saves
JohnTheNerd
JohnTheNerdOP6mo ago
that makes sense the OOM doesn't kill your process. it freezes the entire pod
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
Currently praying
JohnTheNerd
JohnTheNerdOP6mo ago
I'll get a pod of my own and keep trying once I get paid until then, I'll go to sleep since I work in the AM lol
riverfog7
riverfog76mo ago
I cant save either because of a bug it complains when a model is offloaded
riverfog7
riverfog76mo ago
trying this
No description
riverfog7
riverfog76mo ago
No description
riverfog7
riverfog76mo ago
that is the maximum usage so you probably needed like 10 more gigs of ram 😦
JohnTheNerd
JohnTheNerdOP6mo ago
😭 is this all you changed for it to work with multi GPUs?
riverfog7
riverfog76mo ago
riverfog7
riverfog76mo ago
and your model config(?) part is wrong you need to quant it to 4bit int for it to fit in 2xRTX3090
JohnTheNerd
JohnTheNerdOP6mo ago
I was just hoping to get the kv cache stuff. I use the AWQ quant because it's much much better than a straight 4bpw quant maybe I don't even need to quantize the model lol
riverfog7
riverfog76mo ago
yeah you can do only kv cache quants its on somewhere at the llmcompressor repo in their test suite
riverfog7
riverfog76mo ago
GitHub
llm-compressor/tests/e2e/vLLM/recipes/kv_cache/default.yaml at main...
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor
riverfog7
riverfog76mo ago
here idk why they hid it so deep
JohnTheNerd
JohnTheNerdOP6mo ago
that's sure hidden deep interesting thanks!
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
Madiator2011
Madiator20116mo ago
if pod is using too much ram it will throw oom errors
JohnTheNerd
JohnTheNerdOP6mo ago
it does not. it just completely freezes
riverfog7
riverfog76mo ago
The process doesnt get killed and just freezes the entire thing
Madiator2011
Madiator20116mo ago
Yup
riverfog7
riverfog76mo ago
With no errors
JohnTheNerd
JohnTheNerdOP6mo ago
the entire pod simply freezes - no OOM errors. I do wish we could have swap in any way... Docker supports it, one would only need it implemented in the docker run command runpod executes. the fact that system RAM is limited by GPUs without any way of swapping is extremely limiting :/
Madiator2011
Madiator20116mo ago
You can deploy pod with higher ram
JohnTheNerd
JohnTheNerdOP6mo ago
it's especially compounded by having a limit of 6 GPUs
Madiator2011
Madiator20116mo ago
Use filter option
JohnTheNerd
JohnTheNerdOP6mo ago
even the B200 system doesn't have enough RAM for this workload. and the RAM is simply used once, at the end, not even continuously the only way is to get 6 GPUs which is very wasteful when you just need system RAM
Madiator2011
Madiator20116mo ago
For B200 you can get 283 GB RAM
JohnTheNerd
JohnTheNerdOP6mo ago
and even then, you cap out at some point. if you wanted to do this on slightly larger models, say Mistral Large, you're out of luck yes. that was my pod, which froze hence I wish there was some way to swap - RAM is expensive, swap is cheap. obviously I have to pay up for it, but paying for a second B200 hurts when all you want is RAM lol
Madiator2011
Madiator20116mo ago
If it's requires more than that it could be problematic. Not that simple as swap basically uses ssd storage causing faster were off
JohnTheNerd
JohnTheNerdOP6mo ago
that's very fair - I appreciate the honesty
Madiator2011
Madiator20116mo ago
So in both cases there is technical loss. And usually people rent pods for GPUs with lot of VRAm 😅
JohnTheNerd
JohnTheNerdOP6mo ago
I think I am the only person who needs both lol it's because of such a stupid bug, too...
Madiator2011
Madiator20116mo ago
What kinda of bug? Tried submit issue on thier github?
JohnTheNerd
JohnTheNerdOP6mo ago
model.save_pretrained tries to write the weights to RAM. twice. you can imagine the joy it is to find that out with 150gb of weights sitting in VRAM I'm guessing it's deep in the transformers library - which is what loads the weights initially. I suspect no chance in a GitHub issue being seen lol
Madiator2011
Madiator20116mo ago
Difusser? Or something else?
JohnTheNerd
JohnTheNerdOP6mo ago
transformers
riverfog7
riverfog76mo ago
Cpu offloading breaks save too
JohnTheNerd
JohnTheNerdOP6mo ago
also can't have flash attention with cpu offloading my understanding from qwen 2 (not necessarily 2.5) is that it really, really likes flash attention heard many reports of broken output without flash attention
Madiator2011
Madiator20116mo ago
So what are you doing?
riverfog7
riverfog76mo ago
Quantization of KV cache To fp8
JohnTheNerd
JohnTheNerdOP6mo ago
short version: I'm trying to get some magical "scales" to quantize my kv cache more optimally
riverfog7
riverfog76mo ago
With scale factors
Madiator2011
Madiator20116mo ago
GitHub
safetensor/mmap memory leak when per-layer weights are converted do...
System Info While working on GTPQModel which does gptq quantization of hf models and load each layer on to gpu, quantize, and then move layer back to cpu for vram reduction, we noticed a huge cpu m...
JohnTheNerd
JohnTheNerdOP6mo ago
this requires me run inference on the whole model in fp16 thousands of times to calibrate a set of scalar that's interesting but I suspect is not the issue I have. I don't have any issues moving weights to the GPU, and do not convert dtypes at all the entire process runs just fine. right at the end when I call save, it eats the entire system RAM
Madiator2011
Madiator20116mo ago
Anyway late here so bed time for me
JohnTheNerd
JohnTheNerdOP6mo ago
fair enough - maybe I'll get another pod today and try again with a lot more RAM this time I'll post here how it goes lol
riverfog7
riverfog76mo ago
8x 3090 100percent works
JohnTheNerd
JohnTheNerdOP6mo ago
can you get 8? i thought cap was 6
riverfog7
riverfog76mo ago
Cuda oomed it thi
JohnTheNerd
JohnTheNerdOP6mo ago
huh I'll give it a shot oh cuda oom'ed it?
riverfog7
riverfog76mo ago
So 8 should work Yeah for 7x3090
JohnTheNerd
JohnTheNerdOP6mo ago
oh ok that's during weight loading also I'll have flash attention which saves a bit
riverfog7
riverfog76mo ago
No it happened while quantizing and i had flash attn on
JohnTheNerd
JohnTheNerdOP6mo ago
huh, ok then 8 should work. lots of RAM too
riverfog7
riverfog76mo ago
Use the wheels here it works well
JohnTheNerd
JohnTheNerdOP6mo ago
perfect thank you!
riverfog7
riverfog76mo ago
No build time magic 😄
JohnTheNerd
JohnTheNerdOP6mo ago
I'll share the scales if I get it working I wasted at least an hour of B200 time on just this lol only 4090 can give me 8 at a time it seems. still workable - and a whopping 880GB RAM which should definitely be enough
riverfog7
riverfog76mo ago
Yeah you beed about 300gigs *need
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
im using this now
No description
riverfog7
riverfog76mo ago
he uses AWQ which llmcompressor does not support GPTQ was just for testing
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
kv cache to fp8 weights to int4 (with AWQ) Qwen2.5-72B-Instruct this one
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
Yes
riverfog7
riverfog76mo ago
No description
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
8x Asomething With 24gig vram
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
I think thats right
riverfog7
riverfog76mo ago
No description
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
actually its not my money 😄
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
its saving now
riverfog7
riverfog76mo ago
No description
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
yes maybe about 1.5hrs?
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
it says 1hr 20min for quantizing only pod uptime is 2hr due to model downloading and installing dependencies
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
and bc of my stupidity
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
i selected the wrong wersion of pytorch 😄 saving takes a lot of time tho
riverfog7
riverfog76mo ago
the code
riverfog7
riverfog76mo ago
./models/Qwen2.5-72B-Instruct-W4A16-FP8-KV this should be ./models/Qwen2.5-72B-Instruct-FP8-KV
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
W4A16 means weights quantized to 4bit and Activation 16bit but i didnt quantize any my (previous) school ig?
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
got about 500$ for research funds
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
technically schools property but only i can use it
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
its writing to disk now almost done
JohnTheNerd
JohnTheNerdOP6mo ago
ooo awesome! i just filled my runpod account with 15$ without checking lol I can do another model for you if you want
riverfog7
riverfog76mo ago
its finished now
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
meanwhile I did something that may be useful. I'm running a benchmark suite on my qwen setup. I will re-run it with the scales too
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
unfortunately because it's local it's slooooow lol - 12000 prompts to run on two 3090s I'm already running fp8 kv cache - just with e5m2 that's what benchmarks are running on
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
no
riverfog7
riverfog76mo ago
quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: {num_bits: 8, type: float, symmetric: true, strategy: tensor} is this right tho?
JohnTheNerd
JohnTheNerdOP6mo ago
yes it is
riverfog7
riverfog76mo ago
what does symmetric mean
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
yes e5m2 is one sign bit, 5 exponent bits, 2 mantissa bits
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
e4m3 is one sign bit 4 exp 3 mantissa
JohnTheNerd
JohnTheNerdOP6mo ago
you can choose to do e4m3 instead. but exponent in a float determines the range of numbers you can represent, which makes it awful this helps e4m3
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I don't know what symmetric is so I'm curious too lol
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
the idea is that you can try to have a list of numbers you multiply the kv-cache entirely by. this lets you get a little closer to fp16 even with the less range let's consider a floating point number. -35x10^6 the minus is the sign bit. plus or minus. we're left with 7 bits the 35 would be the mantissa. and the 6 would be the exponent (roughly, so)
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
if I only have two bits for the mantissa, I cannot represent 35 anymore. I must round it to a number I can represent
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
the 6 is the exponent. this effectively determines the range in which I can represent numbers - as I cannot say, have a 10^500 with a 4-bit exponent
JohnTheNerd
JohnTheNerdOP6mo ago
since 500 doesn't fit in 4 bits oooooo thank you! I'll try it out after the benchmarks run
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
thats only the log and code files uploading the model now
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
no way 140gigs is uploading that fast
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
can I have the kv scales? should be much smaller
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
so the operation we are doing doesn't actually care about model weight outputs to explain I must go back here
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
yep the scales are basically a lot of numbers say I happened to have a lot of GPU power. I can run everything with its full precision for a little bit, paid by the hour
riverfog7
riverfog76mo ago
how tho i saved the full model
JohnTheNerd
JohnTheNerdOP6mo ago
it's just a json file kv_cache_something i think kv_cache_scales.json https://docs.vllm.ai/en/v0.6.3/quantization/fp8_e4m3_kvcache.html
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
I would run thousands of prompts I would check how much that fp8 kv cache actually differs from the fp16 kv cache and come up with a set of numbers, when multiplied with parts of the fp8 cache, get as close to the fp16 versions as possible those are my scales
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
it didnt save that maybe its fused
JohnTheNerd
JohnTheNerdOP6mo ago
correct interesting
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
yep, that's right
riverfog7
riverfog76mo ago
yeah
JohnTheNerd
JohnTheNerdOP6mo ago
I found this
JohnTheNerd
JohnTheNerdOP6mo ago
GitHub
vllm/examples/fp8/extract_scales.py at v0.6.6 · vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm
JohnTheNerd
JohnTheNerdOP6mo ago
GitHub
[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together faile...
Your current environment Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3...
JohnTheNerd
JohnTheNerdOP6mo ago
I believe they extract it from the model
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
JohnTheNerd
JohnTheNerdOP6mo ago
llm-compressor is by vllm too. I suspect it'll work fine what happens if you don't set compressed=true I wonder
riverfog7
riverfog76mo ago
I think It doesnt use compressed tensor format
JohnTheNerd
JohnTheNerdOP6mo ago
I see in any case I will take a better look at the weights tomorrow - I should go to bed it's 2am here lol I suspect it's uploading on your end anyway
riverfog7
riverfog76mo ago
Yeah
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
I think it is fused Got an error related to k_scales when saving with cpu offload last time
JohnTheNerd
JohnTheNerdOP6mo ago
ugh, any ideas how to extract it out?
riverfog7
riverfog76mo ago
riverfog7/Qwen2.5-72B-Instruct-FP8-KV
JohnTheNerd
JohnTheNerdOP6mo ago
yep, it's fused
riverfog7
riverfog76mo ago
"model.layers.14.self_attn.k_scale": "model-00006-of-00031.safetensors",
JohnTheNerd
JohnTheNerdOP6mo ago
I'll look in detail tomorrow
riverfog7
riverfog76mo ago
okay so
JohnTheNerd
JohnTheNerdOP6mo ago
yes
riverfog7
riverfog76mo ago
have to extract that 😄
JohnTheNerd
JohnTheNerdOP6mo ago
this can be extracted just a pain
riverfog7
riverfog76mo ago
how about just quantizing the model to int4 with fp8 kv cache and loading that instead
JohnTheNerd
JohnTheNerdOP6mo ago
I could do that but I suspect it'll reduce quality significantly I have a different idea... I'm thinking of just taking those scales and injecting them into the safetensors file for the awq quant throwing that all in AWQ is nice because it relies on calibration to determine the most important 1.5% of weights. then it leaves those at fp16, quantizing the rest to int4 maybe a support thread in the runpod discord isn't the best place to discuss this tho lol
riverfog7
riverfog76mo ago
No description
riverfog7
riverfog76mo ago
its actually better than AWQ
JohnTheNerd
JohnTheNerdOP6mo ago
interesting I should try it
riverfog7
riverfog76mo ago
its for qwen2 tho and you can calibrate while quantizing the weights
JohnTheNerd
JohnTheNerdOP6mo ago
that's true
riverfog7
riverfog76mo ago
like the kv cache yeah
JohnTheNerd
JohnTheNerdOP6mo ago
I'll do that, yes good thing I have the 17$ on my account lol that should be way more than enough
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
yeah actual data is in the .safetensors file
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
its from the qwen docs i think there are some llm benchmarking software so maybe use that.?
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
btw this thread has become VERY massive
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
KV cache calibration is finished GPTQ quanting left
riverfog7
riverfog76mo ago
No description
riverfog7
riverfog76mo ago
I think he will need more than 15$ for the quantizing
riverfog7
riverfog76mo ago
No description
riverfog7
riverfog76mo ago
No description
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
his gpu is 2x3090 fp8 doesnt fit in that
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
and the benchmarks said that gptq 4bit performs better than awq so i went with int4 weights fp16 activations with GPTQ its int4-W4A16
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
idk about that tho there's two types of quantization methods in llmcompressor
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
QuantizationModifier and GPTQModifier
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
idk whats the difference but i used GPTQModifier
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
okay its finished finally
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
fuck disk quota exceeded
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
(needs to wait another 5 hours)
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
yeah its on a py file the process got killed
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
Im on 2xH200 now much faster iterations per second instead of seconds per iteration
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
about 1hr left i hate myself It sort of finished but why is the safetensors file size simillar to the original model if it is a 4bit quantized model something's wrong
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
it saves by itself
No description
riverfog7
riverfog76mo ago
the recipie
No description
Unknown User
Unknown User6mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog76mo ago
after uploading ill try loading with 2xA40 should work
JohnTheNerd
JohnTheNerdOP6mo ago
that doesn't sound right
JohnTheNerd
JohnTheNerdOP6mo ago
that quantization config looks wrong
riverfog7
riverfog76mo ago
its this
JohnTheNerd
JohnTheNerdOP6mo ago
I failed to figure it out lol
riverfog7
riverfog76mo ago
Lol

Did you find this page helpful?