Topics

RunPod•4w ago

Pod ran out of CPU RAM

I somehow managed to run out of RAM (not VRAM, system RAM)... right after a very compute-heavy operation (calculating quantized KV-Cache scales)... while running model.save_pretrained... while the weights are still in VRAM... The pod is still running, but completely unresponsive. Now that you're done laughing at my misfortune, is there anything at all I can do to save those weights? Even enabling some swap would be completely fine... I just want the weights to save to the networked drive... Pod ID: tybrzp4aphrz3d

351 Replies

riverfog7•4w ago

You should contact support on their website without terminating the pod

JohnTheNerdOP•4w ago

OK - thanks. Hopefully they get back to me soon...

riverfog7•4w ago

If the process got killed, there is no way to recover data soo

JohnTheNerdOP•4w ago

I know that the process is alive and the data is still stored in VRAM. I ran into similar issues with local containers that ran out of memory, simply adding some memory (whether it's RAM or swap) will immediately bring it back to life. It's merely thrashing as it tries to clear the disk cache while new data is being written to. Still don't know how it managed to eat that much, the weights are 140GB and I have 283GB of RAM...

riverfog7•4w ago

Wow if its a H100 you are burning money fast Hope support reaches you soon

JohnTheNerdOP•4w ago

It's a B200. I'm burning more money than I'd like.....

riverfog7•4w ago

Lol

JohnTheNerdOP•4w ago

It would be very funny if it wasn't my pod lol

No description

riverfog7•4w ago

Maybe 2 instance of model loaded to system ram?

JohnTheNerdOP•4w ago

That's very possible. I guess it might be trying to load it to RAM while it writes to disk or something Sad part is that the file I want is only a few megabytes, but the only way to get it is to call model.save_pretrained

riverfog7•4w ago

Ohh you running quantization?

JohnTheNerdOP•4w ago

Not quite. I'm calculating quantized KV scale factors. The idea is to be able to quantize the KV-cache down to 8 bits while losing very very little in accuracy. You can take out an extra bit from the exponent, making kv-cache weights e4m3 (with one sign bit) instead of e5m2. However, this destroys the numerical range in which you can represent weights, since you just removed a whole bit from the floating point exponent. If you happened to have some magic numbers you can multiply each weight by, calibrated by running thousands of inference without any quantization on a very powerful GPU... You still wouldn't quite get to non-quantized quality, but you'd get quite close

riverfog7•4w ago

yeah saw that on vllm docs

JohnTheNerdOP•4w ago

Unfortunately I do not have 140GB of VRAM at home to go calculate my own scale factors lol

riverfog7•4w ago

doesnt it work on cpu?

JohnTheNerdOP•4w ago

... I don't have 140GB of RAM, either. It's also painfully slow on CPU, and Flash Attention won't work. AFAICT Qwen really wants Flash Attention - people are saying the model breaks pretty badly without it

riverfog7•4w ago

maybe the "running thousands of inference without any quantization on a very powerful GPU" part is a bottleneck if you can run it on CPU you can always rent some high mem machines

JohnTheNerdOP•4w ago

It's not very different than simply running the model a few thousand times. But that's not very fast when you are running a 70b at full precision lol

riverfog7•4w ago

yeah

JohnTheNerdOP•4w ago

Apparently they're based in New Jersey, and it's 11:30PM there ... maybe I should just stop the money burning

riverfog7•4w ago

i think so too

JohnTheNerdOP•4w ago

Well, I tried lol

riverfog7•4w ago

I have a question: if it is running inference over and over can it do the calibration layer by layer?

JohnTheNerdOP•4w ago

It can, actually, yeah That's a very good point Qwen 2.5 has 80 layers... One layer at a time would probably easily fit on a GPU I have at home

riverfog7•4w ago

probably hurts to implement tho (if there is no implementation)

JohnTheNerdOP•4w ago

There is definitely no implementation. Even the code in vllm's docs are broken lol I had to modify it to get it to work at all Then I had the rude awakening of "you can't do this with a quantized model"... and here we are

riverfog7•4w ago

lol no multi-gpu implementation too?

JohnTheNerdOP•4w ago

Nope

riverfog7•4w ago

sad

JohnTheNerdOP•4w ago

Well, maybe actually Not that it matters with 140GB of weights lol

JohnTheNerdOP•4w ago

https://github.com/vllm-project/llm-compressor

GitHub

GitHub - vllm-project/llm-compressor: Transformers-compatible libra...

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor

JohnTheNerdOP•4w ago

They just pass in an AutoModelForCausalLM.from_pretrained model to the library

riverfog7•4w ago

https://github.com/vllm-project/llm-compressor/issues/965 maybe this is simillar to your case

GitHub

The new version 0.3.0 takes a long time for quantization and eventu...

Describe the bug I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to th...

JohnTheNerdOP•4w ago

Interesting - that allows multi-GPU. I wonder if I could implement some sort of per-layer processing... It would be miserably slow for sure, especially since I can't do much batching without seriously modifying the library

riverfog7•4w ago

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/README.md

GitHub

llm-compressor/examples/big_models_with_accelerate/README.md at mai...

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor

JohnTheNerdOP•4w ago

Hmm... combining them all, I just might be able to fit it in a lot of small GPUs... saving lots of money. Probably still un-doable at home though since I don't think I can pass a quantized model through at all Thank you!

riverfog7•4w ago

try 4x A40s 192GB VRAM and about 1/6 of B200 pricing

JohnTheNerdOP•4w ago

How can one get more RAM in runpod? even if it's swap My understanding is that because I'm in a container, I can't just add in swap

riverfog7•4w ago

Same for me

riverfog7•4w ago

https://github.com/vllm-project/llm-compressor/issues/1183 lol

GitHub

OOM during save_pretrained of compressed model · Issue #1183 · vl...

Describe the bug The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU. This was hitting the llmcompressor's modified save_pretrained_wrapper from llm-...

riverfog7•4w ago

same issue looks like quanting on cpu should be possible

JohnTheNerdOP•4w ago

Yes, I can see the same frustration in the comments section lol Maybe I'll just go make an EC2 instance with a lot of EBS storage, enable swap, and go away for a month lol Probably cheaper...

riverfog7•4w ago

try these

JohnTheNerdOP•4w ago

Hm? I didn't see any suggestions in the GitHub issue

riverfog7•4w ago

they are cheap with spot requests

No description

riverfog7•4w ago

no GPUs tho

JohnTheNerdOP•4w ago

I suspect if I'm going CPU, I can go much much cheaper

riverfog7•4w ago

yeah and if you are going spot, look for spot savings=90% noone is using them and they dont get terminated as often

JohnTheNerdOP•4w ago

Spot seems iffy. I use AWS at work and have been evicted before - especially for long-running workloads But nothing stops me from getting like a c7a.medium for a month, just letting it churn all day all night, with some EBS as swap

riverfog7•4w ago

thats right and go with instance store rather than EBS if that's possible

JohnTheNerdOP•4w ago

That's true - it's gonna be a lot faster

riverfog7•4w ago

NVME powerr 😄

JohnTheNerdOP•4w ago

Yes lol Anyway thanks a lot for your help! Although the results are gone, hope my mistake at least gave people a laugh lol

riverfog7•4w ago

I have the same experience with 70B models on a H100 can relate

Jason•4w ago

Filter out when you create a pod Well I feel kind of sad for your lost of progress

JohnTheNerdOP•4w ago

I... couldn't find enough RAM. Maybe it speaks to my horrifying setup, but the pod I was on had 260 something gigabytes and I OOM'd it... I do too, but such is life. I want to re-run it regardless but I need a genuinely stupid amount of RAM to assure myself this will never ever happen again lol

Jason•4w ago

Hmm I think the only way is More gpus

JohnTheNerdOP•4w ago

My guess is that torch tried to copy the weights to RAM... twice. No idea why it would happen. Seeing I am working with a 72b at bf16, that's 288+GB of RAM I need

Jason•4w ago

Or other gpu types

JohnTheNerdOP•4w ago

I had a B200, VRAM isn't an issue. System RAM is It can even be swap tbh, but since I'm in a container, I can't have my own swap. Someone has to give it to me during the docker run command. And runpod doesn't have such an option sadly...

Jason•4w ago

Yeah I mean try sliding right that gpu count slider And then you'll see the pod will have more ram, we'll if you use it for ram only it'll be waste too

JohnTheNerdOP•4w ago

That's a good point I could get lots of cheaper GPUs and tensor parallelize, but the RAM was what killed my workflow from the start

Jason•4w ago

I think there is shm In /dev/shm, don't know if that's usable for you Check your training script again heheh

JohnTheNerdOP•4w ago

I suspect not. It's all abstracted away from me - torch is what eats the RAM the line that killed my pod was model.save_pretrained() and it's hard to avoid that lol

Jason•4w ago

Can chatgpt provide with a reasonable explanation of why Maybe it can explain hf's code lol

JohnTheNerdOP•4w ago

Possibly lol I should ask after work

Jason•4w ago

Maybe got to do with training and then saving it

JohnTheNerdOP•4w ago

The training code is rather brief - nothing crazy, some open source code from vllm that I modified to work with another LLM It's not even training code - it's quantization code

Jason•4w ago

How long did it took you? Ic Calibrating?

JohnTheNerdOP•4w ago

A few hours on a B200, with another few spent on various failures I'm getting scales for a kv-cache quantization. I run LLMs at home and I need to quantize my kv-cache down to 8bpw with minimal loss https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache.html except the example code is broken

riverfog7•4w ago

Ways to go bankrupt fast 😦

Jason•4w ago

Oh people actually do use high-end gpu for that

JohnTheNerdOP•4w ago

You need to run inference on unquantized model for it... and I am running a 72b at home

riverfog7•4w ago

Aws does it on a 8xH100

JohnTheNerdOP•4w ago

You don't need a high-end GPU for vllm. Well, I don't. But I have an insane setup lol

Jason•4w ago

Ooh For which of their models

JohnTheNerdOP•4w ago

This runs qwen 2.5 72b with 14k context on 2x3090, with the nice PagedAttention that lets you serve many people at once:

Jason•4w ago

What about the new llama oh wait even a pair of 4090 doesnt run it

JohnTheNerdOP•4w ago

#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"

#!/bin/bash

. /mnt/disk/vllm-venv/bin/activate

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.4/bin:$PATH
export CUDACXX=/usr/local/cuda-12.4/bin/nvcc

export RAY_memory_monitor_refresh_ms=0
#export OMP_NUM_THREADS=4

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_FLASHINFER_SAMPLER=1
#export VLLM_USE_RAY_SPMD_WORKER=1
#export VLLM_USE_RAY_COMPILED_DAG=1
#export VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_TRITON_AWQ=1
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_USE_V1=0
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1
export VLLM_CUDA_MEM_ALIGN_KV_CACHE=1
#export VLLM_MLA_DISABLE=1

set -e

cd /mnt/disk/models/

vllm serve "./Qwen2.5-72B-Instruct-AWQ" \
--served-model-name="qwen2.5-72b" \
--max-model-len="14000" \
--dtype="auto" \
--gpu-memory-utilization="0.993" \
--distributed-executor-backend="mp" \
--enable-chunked-prefill=false \
--kv-cache-dtype=fp8_e5m2 \
--quantization="awq_marlin" \
--enforce-eager \
--scheduling-policy="priority" \
--tensor-parallel-size="2" \
--swap-space="1" \
--enable-prefix-caching \
--disable-log-requests \
--disable-log-stats \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host="0.0.0.0" --port="5000"

Jason•4w ago

That's cool

JohnTheNerdOP•4w ago

Too big and not worth the extra VRAM. 104B for what roughly benchmarks the same as qwen 2.5/llama 3.3

Jason•4w ago

Ohh

JohnTheNerdOP•4w ago

I could squeeze even more quality out of this poor server if I could get the KV-cache scales Unfortunately for that I need to put a lot more money in my account lol Maybe next paycheck... If I can sanely get the RAM...

riverfog7•4w ago

I have a question Why do you need to run a fill model for that kv cache scaling

Jason•4w ago

No other people had even did this for qwen before?

riverfog7•4w ago

If you are running a quantized model

JohnTheNerdOP•4w ago

I found no example of such I am just too short for fp16 I can run an AWQ quant of a 70b with 13k context on fp16 kv-cache. But those 2 billion extra parameters make it not fit at all I am 900MB short, and going below AWQ is a significant hit in answer quality. I can get 14k context on fp8 with a 72B model, but at that point I have another choice: mantissa bits vs exponent bits I currently run with 5 exponent bits and 2 mantissa bits. It visibly impacts quality. If i can get the scales, I can cut out another bit from the exponent and give it to the mantissa, while still being very close to a fp16 KV-cache ... it's completely insane, I've been working on this setup for years

riverfog7•4w ago

Hmm

JohnTheNerdOP•4w ago

Years of /r/LocalLLaMA lol I am still amazed that we can run something in our house that somewhat rivals cloud LLM's

riverfog7•4w ago

Imma try the cache scaling

JohnTheNerdOP•4w ago

It's expensive on large models Definitely make sure you have more than 2x system RAM vs your weight size lol Don't make the mistake I made Would you like my script?

Jason•4w ago

Sure

riverfog7•4w ago

Sure (Proceeds to try ut on a A40)

JohnTheNerdOP•4w ago

kv-params.py

JohnTheNerdOP•4w ago

installing flash-attn takes a long time The MAX_JOBS I set is for the RAM I have. Might OOM your system (it took over 100GB ram during compilation afaict) A networked drive is super useful. You can use a CPU instance to download model weights into /workspace, and set up a virtual env to run the pip commands without eating precious GPU machine hours.

riverfog7•4w ago

Probably because the venv is in a network volume

JohnTheNerdOP•4w ago

No it is just a stupidly compute heavy process. I didn't use the networked venv for it, hindsight is 20/20 It eats up a huge amount of RAM. Because of RunPod systems you see a huge amount of CPU cores available and this causes ninja to run lots of tasks, making you OOM, so MAX_JOBS is a must. I found that it ate 16 CPU cores consistently for 30ish minutes - hence I recommend the networked venv

riverfog7•4w ago

you should install a prebuilt wheel

JohnTheNerdOP•4w ago

I couldn't find one that works Maybe it's the B200

JohnTheNerdOP•4w ago

https://github.com/mjun0812/flash-attention-prebuild-wheels exists but is not for CUDA 12.8

GitHub

GitHub - mjun0812/flash-attention-prebuild-wheels: Provide with pre...

Provide with pre-build flash-attention package wheels using GitHub Actions - mjun0812/flash-attention-prebuild-wheels

riverfog7•4w ago

u need a 4bit model with 8bit kv cache ig? @JohnTheNerd (sort of) good news I think it works with 1xA40

JohnTheNerdOP•4w ago

yep how? that won't even fit a single layer

riverfog7•4w ago

bad news is

riverfog7•4w ago

No description

riverfog7•4w ago

with CPU offload turned on maybe it will work with your local machine

JohnTheNerdOP•4w ago

oh... yeah...

riverfog7•4w ago

# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=False,
    num_gpus=1,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select model and load it.
MODEL_ID = "/workspace/models/Qwen2.5-72B-Instruct"
device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=False,
    num_gpus=1,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

JohnTheNerdOP•4w ago

what's the speed like?

riverfog7•4w ago

better(sort of) now

No description

JohnTheNerdOP•4w ago

how many seconds per iteration?

riverfog7•4w ago

idk cuz it didnt even complete 1 iteration

JohnTheNerdOP•4w ago

hahahahahaahaha

riverfog7•4w ago

to be fair only only about 3 minuites has passed

JohnTheNerdOP•4w ago

assuming 5 minutes per iteration, that's... over a week for 2048 samples

riverfog7•4w ago

maybe ill try with 4xA40s

riverfog7•4w ago

lol

No description

riverfog7•4w ago

flash attention cannot run on meta device so it will be slower than that

Jason•4w ago

Create a new pod

riverfog7•4w ago

About 10sec per iteration with 4x 3090 5hrs Total

JohnTheNerdOP•4w ago

that's actually really good how much RAM do you get on that pod?

Jason•4w ago

Try 4090

riverfog7•4w ago

Im broke 😦 200gigs? Im gonna pray for no OOM

JohnTheNerdOP•4w ago

I think you'll get an OOM I had more and I got an OOM

Jason•4w ago

What is the process that you're doing called?

JohnTheNerdOP•4w ago

I'm not sure it has an official name. I'm collecting KV-cache quantization scaling factors the vllm link above has more information

Jason•4w ago

Okay thanks!

Jason•4w ago

seems like its this part

No description

JohnTheNerdOP•4w ago

yes it is. but the code there didn't work for me. see my script above for what does work at least until the OOM lol

riverfog7•4w ago

Im trying with 32 samples To see if it saves

JohnTheNerdOP•4w ago

that makes sense the OOM doesn't kill your process. it freezes the entire pod

Jason•4w ago

Okk

riverfog7•4w ago

Currently praying

JohnTheNerdOP•4w ago

I'll get a pod of my own and keep trying once I get paid until then, I'll go to sleep since I work in the AM lol

riverfog7•4w ago

I cant save either because of a bug it complains when a model is offloaded

riverfog7•4w ago

trying this

No description

riverfog7•4w ago

No description

riverfog7•4w ago

that is the maximum usage so you probably needed like 10 more gigs of ram 😦

JohnTheNerdOP•4w ago

😭 is this all you changed for it to work with multi GPUs?

riverfog7•4w ago

the code

Untitled.ipynb

riverfog7•4w ago

and your model config(?) part is wrong you need to quant it to 4bit int for it to fit in 2xRTX3090

JohnTheNerdOP•4w ago

I was just hoping to get the kv cache stuff. I use the AWQ quant because it's much much better than a straight 4bpw quant maybe I don't even need to quantize the model lol

riverfog7•4w ago

yeah you can do only kv cache quants its on somewhere at the llmcompressor repo in their test suite

riverfog7•4w ago

https://github.com/vllm-project/llm-compressor/blob/main/tests/e2e/vLLM/recipes/kv_cache/default.yaml

GitHub

llm-compressor/tests/e2e/vLLM/recipes/kv_cache/default.yaml at main...

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor

riverfog7•4w ago

here idk why they hid it so deep

JohnTheNerdOP•4w ago

that's sure hidden deep interesting thanks!

Jason•4w ago

That's for tests.. Maybe it should be documented in vllm

Madiator2011•4w ago

if pod is using too much ram it will throw oom errors

JohnTheNerdOP•4w ago

it does not. it just completely freezes

riverfog7•4w ago

The process doesnt get killed and just freezes the entire thing

Madiator2011•4w ago

Yup

riverfog7•4w ago

With no errors

JohnTheNerdOP•4w ago

the entire pod simply freezes - no OOM errors. I do wish we could have swap in any way... Docker supports it, one would only need it implemented in the docker run command runpod executes. the fact that system RAM is limited by GPUs without any way of swapping is extremely limiting :/

Madiator2011•4w ago

You can deploy pod with higher ram

riverfog7•4w ago

Btw https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1 Wheels r here

GitHub

Release v2.7.4.post1 · Dao-AILab/flash-attention

JohnTheNerdOP•4w ago

it's especially compounded by having a limit of 6 GPUs

Madiator2011•4w ago

Use filter option

JohnTheNerdOP•4w ago

even the B200 system doesn't have enough RAM for this workload. and the RAM is simply used once, at the end, not even continuously the only way is to get 6 GPUs which is very wasteful when you just need system RAM

Madiator2011•4w ago

For B200 you can get 283 GB RAM

JohnTheNerdOP•4w ago

and even then, you cap out at some point. if you wanted to do this on slightly larger models, say Mistral Large, you're out of luck yes. that was my pod, which froze hence I wish there was some way to swap - RAM is expensive, swap is cheap. obviously I have to pay up for it, but paying for a second B200 hurts when all you want is RAM lol

Madiator2011•4w ago

If it's requires more than that it could be problematic. Not that simple as swap basically uses ssd storage causing faster were off

JohnTheNerdOP•4w ago

that's very fair - I appreciate the honesty

Madiator2011•4w ago

So in both cases there is technical loss. And usually people rent pods for GPUs with lot of VRAm 😅

JohnTheNerdOP•4w ago

I think I am the only person who needs both lol it's because of such a stupid bug, too...

Madiator2011•4w ago

What kinda of bug? Tried submit issue on thier github?

JohnTheNerdOP•4w ago

model.save_pretrained tries to write the weights to RAM. twice. you can imagine the joy it is to find that out with 150gb of weights sitting in VRAM I'm guessing it's deep in the transformers library - which is what loads the weights initially. I suspect no chance in a GitHub issue being seen lol

Madiator2011•4w ago

Difusser? Or something else?

JohnTheNerdOP•4w ago

transformers

Madiator2011•4w ago

https://tenor.com/view/autobots-rollout-rollout-gif-8446870982418360184

Tenor

riverfog7•4w ago

Cpu offloading breaks save too

JohnTheNerdOP•4w ago

also can't have flash attention with cpu offloading my understanding from qwen 2 (not necessarily 2.5) is that it really, really likes flash attention heard many reports of broken output without flash attention

Madiator2011•4w ago

So what are you doing?

riverfog7•4w ago

Quantization of KV cache To fp8

JohnTheNerdOP•4w ago

short version: I'm trying to get some magical "scales" to quantize my kv cache more optimally

riverfog7•4w ago

With scale factors

Madiator2011•4w ago

https://github.com/huggingface/transformers/issues/34366

GitHub

safetensor/mmap memory leak when per-layer weights are converted do...

System Info While working on GTPQModel which does gptq quantization of hf models and load each layer on to gpu, quantize, and then move layer back to cpu for vram reduction, we noticed a huge cpu m...

JohnTheNerdOP•4w ago

this requires me run inference on the whole model in fp16 thousands of times to calibrate a set of scalar that's interesting but I suspect is not the issue I have. I don't have any issues moving weights to the GPU, and do not convert dtypes at all the entire process runs just fine. right at the end when I call save, it eats the entire system RAM

Madiator2011•4w ago

Anyway late here so bed time for me

JohnTheNerdOP•4w ago

fair enough - maybe I'll get another pod today and try again with a lot more RAM this time I'll post here how it goes lol

riverfog7•4w ago

8x 3090 100percent works

JohnTheNerdOP•4w ago

can you get 8? i thought cap was 6

riverfog7•4w ago

Cuda oomed it thi

JohnTheNerdOP•4w ago

huh I'll give it a shot oh cuda oom'ed it?

riverfog7•4w ago

So 8 should work Yeah for 7x3090

JohnTheNerdOP•4w ago

oh ok that's during weight loading also I'll have flash attention which saves a bit

riverfog7•4w ago

No it happened while quantizing and i had flash attn on

JohnTheNerdOP•4w ago

huh, ok then 8 should work. lots of RAM too

riverfog7•4w ago

Use the wheels here it works well

JohnTheNerdOP•4w ago

perfect thank you!

riverfog7•4w ago

No build time magic 😄

JohnTheNerdOP•4w ago

I'll share the scales if I get it working I wasted at least an hour of B200 time on just this lol only 4090 can give me 8 at a time it seems. still workable - and a whopping 880GB RAM which should definitely be enough

riverfog7•4w ago

Yeah you beed about 300gigs *need

Jason•4w ago

btw why do you use GPTQModifier in the quant instead only kv_cache_scheme

riverfog7•4w ago

im using this now

No description

riverfog7•4w ago

he uses AWQ which llmcompressor does not support GPTQ was just for testing

Jason•4w ago

ohh ic thanks oh now to fp8? what model are you doing

riverfog7•4w ago

kv cache to fp8 weights to int4 (with AWQ) Qwen2.5-72B-Instruct this one

Jason•4w ago

ooh is the process still running?

riverfog7•4w ago

Yes

riverfog7•4w ago

No description

Jason•4w ago

ah using what gpu?

riverfog7•4w ago

8x Asomething With 24gig vram

Jason•4w ago

ohh a5000*?

riverfog7•4w ago

I think thats right

riverfog7•4w ago

No description

Jason•4w ago

Seems like a good deal

riverfog7•4w ago

actually its not my money 😄

Jason•4w ago

Ohh That's nice

riverfog7•4w ago

its saving now

riverfog7•4w ago

No description

Jason•4w ago

Wohooo will you publish it how long did it take in total

riverfog7•4w ago

yes maybe about 1.5hrs?

Jason•4w ago

oh quite effecient

riverfog7•4w ago

it says 1hr 20min for quantizing only pod uptime is 2hr due to model downloading and installing dependencies

Jason•4w ago

ic

riverfog7•4w ago

and bc of my stupidity

Jason•4w ago

yeah still faster than a few hours in b200 wow

riverfog7•4w ago

i selected the wrong wersion of pytorch 😄 saving takes a lot of time tho

riverfog7•4w ago

the code

run.py

riverfog7•4w ago

./models/Qwen2.5-72B-Instruct-W4A16-FP8-KV this should be ./models/Qwen2.5-72B-Instruct-FP8-KV

Jason•4w ago

whats the difference whats the second one? so who paid for this run haha

riverfog7•4w ago

W4A16 means weights quantized to 4bit and Activation 16bit but i didnt quantize any my (previous) school ig?

Jason•4w ago

woah

riverfog7•4w ago

got about 500$ for research funds

Jason•4w ago

ig? why i guess noicee

riverfog7•4w ago

technically schools property but only i can use it

Jason•4w ago

hahah okay i see

riverfog7•4w ago

its writing to disk now almost done

JohnTheNerdOP•4w ago

ooo awesome! i just filled my runpod account with 15$ without checking lol I can do another model for you if you want

riverfog7•4w ago

its finished now

Jason•4w ago

☺️Another time hahah

JohnTheNerdOP•4w ago

meanwhile I did something that may be useful. I'm running a benchmark suite on my qwen setup. I will re-run it with the scales too

Jason•4w ago

Did you estimate after this quant it'll run on your home server or what

JohnTheNerdOP•4w ago

unfortunately because it's local it's slooooow lol - 12000 prompts to run on two 3090s I'm already running fp8 kv cache - just with e5m2 that's what benchmarks are running on

Jason•4w ago

Oh is it a bigger version of this?

JohnTheNerdOP•4w ago

no

riverfog7•4w ago

quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: {num_bits: 8, type: float, symmetric: true, strategy: tensor} is this right tho?

JohnTheNerdOP•4w ago

yes it is

riverfog7•4w ago

what does symmetric mean

Jason•4w ago

It's fp8 kv cache too?

JohnTheNerdOP•4w ago

yes e5m2 is one sign bit, 5 exponent bits, 2 mantissa bits

Jason•4w ago

And this?

riverfog7•4w ago

e4m3 is one sign bit 4 exp 3 mantissa

JohnTheNerdOP•4w ago

you can choose to do e4m3 instead. but exponent in a float determines the range of numbers you can represent, which makes it awful this helps e4m3

Jason•4w ago

I have the response from openai's model if you want lol

JohnTheNerdOP•4w ago

I don't know what symmetric is so I'm curious too lol

Jason•4w ago

So this one is e4m3? Explain what's the effect of the exponential and mantissa amount lol

JohnTheNerdOP•4w ago

the idea is that you can try to have a list of numbers you multiply the kv-cache entirely by. this lets you get a little closer to fp16 even with the less range let's consider a floating point number. -35x10^6 the minus is the sign bit. plus or minus. we're left with 7 bits the 35 would be the mantissa. and the 6 would be the exponent (roughly, so)

Jason•4w ago

– strategy: tensor Indicates that the quantization is applied at the tensor level (as opposed to, say, per channel), meaning the same quantization parameters might be used for entire tensors. – dynamic: false This means that the quantization parameters (such as scaling factors) are fixed after calibration rather than being computed on the fly (“dynamic quantization”). Fixed parameters can sometimes lead to more stable behavior. – symmetric: true Signals that the quantization should be symmetric around zero. In symmetric quantization, the “zero point” is fixed to zero, and the range is symmetric (e.g., –X to +X). This can simplify the arithmetic and sometimes improve performance.

JohnTheNerdOP•4w ago

if I only have two bits for the mantissa, I cannot represent 35 anymore. I must round it to a number I can represent

Jason•4w ago

ic, still hard to understand maybe im lacking the bits thing, isok letme chatgpt it to dig deeper thanks!

JohnTheNerdOP•4w ago

the 6 is the exponent. this effectively determines the range in which I can represent numbers - as I cannot say, have a 10^500 with a 4-bit exponent

riverfog7•4w ago

https://s3.riverfog7.com/public-bucket/Qwen2.5-72B-Instruct-FP8-KV_quantize-files.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=lhQBsIsthBRiz5aEKzdA%2F20250409%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250409T041509Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=e31d7b4208eb0aa30f3a929eeabca7c7efc1573ce7156d2676c08b2b93292ae4

JohnTheNerdOP•4w ago

since 500 doesn't fit in 4 bits oooooo thank you! I'll try it out after the benchmarks run

Jason•4w ago

wow in s3?

riverfog7•4w ago

thats only the log and code files uploading the model now

Jason•4w ago

i thought it was the model 🤣🤣

riverfog7•4w ago

no way 140gigs is uploading that fast

Jason•4w ago

yes way, lucky connections

JohnTheNerdOP•4w ago

can I have the kv scales? should be much smaller

Jason•4w ago

wait you can do that?

JohnTheNerdOP•4w ago

so the operation we are doing doesn't actually care about model weight outputs to explain I must go back here

Jason•4w ago

the output isnt the whole model? is just the scales?

JohnTheNerdOP•4w ago

yep the scales are basically a lot of numbers say I happened to have a lot of GPU power. I can run everything with its full precision for a little bit, paid by the hour

riverfog7•4w ago

how tho i saved the full model

JohnTheNerdOP•4w ago

it's just a json file kv_cache_something i think kv_cache_scales.json https://docs.vllm.ai/en/v0.6.3/quantization/fp8_e4m3_kvcache.html

Jason•4w ago

ahh i see

JohnTheNerdOP•4w ago

I would run thousands of prompts I would check how much that fp8 kv cache actually differs from the fp16 kv cache and come up with a set of numbers, when multiplied with parts of the fp8 cache, get as close to the fp16 versions as possible those are my scales

Jason•4w ago

in short batch processing-like right?

riverfog7•4w ago

it didnt save that maybe its fused

JohnTheNerdOP•4w ago

correct interesting

Jason•4w ago

yeah because the code is like

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)

means its saves the whole modified model isnt it?

JohnTheNerdOP•4w ago

yep, that's right

riverfog7•4w ago

yeah

JohnTheNerdOP•4w ago

I found this

JohnTheNerdOP•4w ago

https://github.com/vllm-project/vllm/blob/v0.6.6/examples/fp8/extract_scales.py

GitHub

vllm/examples/fp8/extract_scales.py at v0.6.6 · vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

JohnTheNerdOP•4w ago

from https://github.com/vllm-project/vllm/issues/6738

GitHub

[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together faile...

Your current environment Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3...

JohnTheNerdOP•4w ago

I believe they extract it from the model

Jason•4w ago

but yeah that isn't using the llmcompressor, yeah nvm don't really know how this extractor thing works

JohnTheNerdOP•4w ago

llm-compressor is by vllm too. I suspect it'll work fine what happens if you don't set compressed=true I wonder

riverfog7•4w ago

I think It doesnt use compressed tensor format

JohnTheNerdOP•4w ago

I see in any case I will take a better look at the weights tomorrow - I should go to bed it's 2am here lol I suspect it's uploading on your end anyway

riverfog7•4w ago

Yeah

Jason•4w ago

https://github.com/vllm-project/llm-compressor/blob/c5dbf0cdb1364c4058de6ae5c1299581765adf12/src/llmcompressor/args/README.md#modelarguments

GitHub

llm-compressor/src/llmcompressor/args/README.md at c5dbf0cdb1364c40...

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM - vllm-project/llm-compressor

riverfog7•4w ago

I think it is fused Got an error related to k_scales when saving with cpu offload last time

JohnTheNerdOP•4w ago

ugh, any ideas how to extract it out?

riverfog7•4w ago

riverfog7/Qwen2.5-72B-Instruct-FP8-KV

riverfog7•4w ago

https://huggingface.co/riverfog7/Qwen2.5-72B-Instruct-FP8-KV

riverfog7/Qwen2.5-72B-Instruct-FP8-KV · Hugging Face

JohnTheNerdOP•4w ago

yep, it's fused

JohnTheNerdOP•4w ago

I see the scales defined here https://huggingface.co/riverfog7/Qwen2.5-72B-Instruct-FP8-KV/blob/main/model.safetensors.index.json

model.safetensors.index.json · riverfog7/Qwen2.5-72B-Instruct-FP8-...

riverfog7•4w ago

"model.layers.14.self_attn.k_scale": "model-00006-of-00031.safetensors",

JohnTheNerdOP•4w ago

I'll look in detail tomorrow

riverfog7•4w ago

okay so

JohnTheNerdOP•4w ago

yes

riverfog7•4w ago

have to extract that 😄

JohnTheNerdOP•4w ago

this can be extracted just a pain

riverfog7•4w ago

how about just quantizing the model to int4 with fp8 kv cache and loading that instead

JohnTheNerdOP•4w ago

I could do that but I suspect it'll reduce quality significantly I have a different idea... I'm thinking of just taking those scales and injecting them into the safetensors file for the awq quant throwing that all in AWQ is nice because it relies on calibration to determine the most important 1.5% of weights. then it leaves those at fp16, quantizing the rest to int4 maybe a support thread in the runpod discord isn't the best place to discuss this tho lol

riverfog7•4w ago

No description

riverfog7•4w ago

its actually better than AWQ

JohnTheNerdOP•4w ago

interesting I should try it

riverfog7•4w ago

its for qwen2 tho and you can calibrate while quantizing the weights

JohnTheNerdOP•4w ago

that's true

riverfog7•4w ago

like the kv cache yeah

JohnTheNerdOP•4w ago

I'll do that, yes good thing I have the 17$ on my account lol that should be way more than enough

Jason•4w ago

I don't see the scales in the model index, isnt it just a index referencing layers to the model file splits? I wonder how the scales file look like

riverfog7•4w ago

yeah actual data is in the .safetensors file

Jason•4w ago

Ic Is there scripts to run these benchmarks

riverfog7•4w ago

its from the qwen docs i think there are some llm benchmarking software so maybe use that.?

Jason•4w ago

I ma look that up and try that someday lol

riverfog7•4w ago

btw this thread has become VERY massive

Jason•4w ago

Hahah no worries right Great job

riverfog7•4w ago

KV cache calibration is finished GPTQ quanting left

riverfog7•4w ago

No description

riverfog7•4w ago

I think he will need more than 15$ for the quantizing

riverfog7•4w ago

riverfog7•4w ago

No description

Jason•4w ago

Btw why do you choose the gptq quant instead like fp8 or anything else

riverfog7•4w ago

his gpu is 2x3090 fp8 doesnt fit in that

Jason•4w ago

int4

riverfog7•4w ago

and the benchmarks said that gptq 4bit performs better than awq so i went with int4 weights fp16 activations with GPTQ its int4-W4A16

Jason•4w ago

oh the int4 is using gptq in vllm docs

riverfog7•4w ago

idk about that tho there's two types of quantization methods in llmcompressor

Jason•4w ago

i see hmm what is the other one?

riverfog7•4w ago

QuantizationModifier and GPTQModifier

Jason•4w ago

meta's ai is free in whatsapp wow

riverfog7•4w ago

idk whats the difference but i used GPTQModifier

Jason•4w ago

i think gptq is more complicated

riverfog7•4w ago

okay its finished finally

Jason•4w ago

Yay

riverfog7•4w ago

fuck disk quota exceeded

Jason•4w ago

Oof Delete some other model

riverfog7•4w ago

(needs to wait another 5 hours)

Jason•4w ago

Hmm What for? You mean repeat the process?

riverfog7•4w ago

yeah its on a py file the process got killed

Jason•4w ago

The importance of error handling 😅 that's sad

riverfog7•4w ago

Im on 2xH200 now much faster iterations per second instead of seconds per iteration

Jason•4w ago

Hahah

riverfog7•3w ago

about 1hr left i hate myself It sort of finished but why is the safetensors file size simillar to the original model if it is a 4bit quantized model something's wrong

Jason•3w ago

Howd you save it

riverfog7•3w ago

it saves by itself

No description

riverfog7•3w ago

the recipie

No description

Jason•3w ago

Ah try load it then

riverfog7•3w ago

after uploading ill try loading with 2xA40 should work

JohnTheNerdOP•3w ago

that doesn't sound right

riverfog7•3w ago

https://huggingface.co/riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV

riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV · Hugging Face

JohnTheNerdOP•3w ago

that quantization config looks wrong

JohnTheNerdOP•3w ago

https://huggingface.co/riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV/blob/main/config.json

config.json · riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV at main

riverfog7•3w ago

its this

riverfog7•3w ago

https://huggingface.co/riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV/blob/main/recipe.yaml

recipe.yaml · riverfog7/Qwen2.5-72B-Instruct-W4A16-FP8-KV at main

JohnTheNerdOP•3w ago

I failed to figure it out lol

riverfog7•3w ago

Lol

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?