R
Runpod•2mo ago
akiratoya13

How to add more RAM to the Pod

Hi! Is there a way to set RAM manually? I mean, the current options is limited for example max 283gb RAM. But I need 1TB. How do I add more RAM? Thank you.
137 Replies
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Is this what you mean? It's empty don't know why.. CPU pods only limited to about 250 gb RAM.
No description
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Ah okayy, finally understand the parameter for it. Thank you so much.. So.... with about $7/hour, I can try use kimi k2 for my self? hahahaha. And try the quantized version with so much cheaper price? :D:D Anyway thank youu
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
apis? Do you mean like OpenRouter, Chutes?
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
I see... I thought of using Runpod because of hit rate limit on free one. and if paying, I wonder if assumed 1 hour on OpenRouter kimi usage can be cheaper than I use runpod vs speed. Basically it gets back to price value.. If runpod is cheaper for 1 hour full usage then it's really worth it ( based on my tests a couple of days and for my use cases ). What do you think? If you said openrouter will be cheaper then I won't bother trying since it's also kind of a work to make the template first for gguf one. << Never try this, I also want to know if gguf one (from unsloth) is enough for roocode usage.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
wow... This is really a hard question hahaahah. Honestly I don't know, but I assume it is actually so much because when I try using gemini pro, it's taking maybe about 80M token within 3 hours, so it's around 28M token per hour or more depending on the task. I see that are cases that is not using much token, but there is using so much token that I'm not even except it is possible. Based on your comment, I think it's wise for me to just try using runpod for full 1 hour paying to see how much it is ( I don't want to do this at first ). hahahah. But still, on OpenRouter, total token matter so much when runpod is no, that's also some things to consider. If we're using maximum token fully 1 hour, of course runpod will be cheaper right? I need maybe 50t/s since when I use roocode, I got that speed ( based on what chutes provider said on openrouter )
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
I also shocked by it.. I don't know how gemini cli works at that time, but it really burns the token so much.. hahaha.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
No no.. after I got billed about $35 within 4 days, I try roocode instantly, and try for the free, and honestly it's good. But about these 2 days, somehow I get limited so much ( maybe the chutes doing the limiting, since openrouter says 1000 request / day, so it should be okay but no ). That's why, if I must pay, I think I want it to be no restriction and of course low cost since it's all for my personal research.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
This is also hard to say since this is relative type question. Like if it's 50t/s, I feel this already pretty pretty good, I'm okay with paying $1.5/hour ( Note: I'm not like a crazy person that turns this on 24 hours HAHAHA. I use it when I'm only on computer too.. So maximum is 3 hours a day ). If 50t/s with 64K token, I already feel that's enough. Even if it's 30t/s, I also already okay. So let say I can even get it on much much lower rate, I also okay with 15t/s 🤣. Sorry to be like a cheapskate, but I'm not rich enough to burn money. Is it too much to ask?
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
When I said 50t/s I mean processing
akiratoya13
akiratoya13OP•2mo ago
Depends on t/s. If it's acceptable, of course I'm okay offloading to ram. On my local PC, I try it with offloading, but I only can run up to 14B model Q4, and when using high context, I already got soooo poor speed.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
If you load a full 64K prompt at 50t/s pp it would take 21 mins But currently testing how fast it is on nvidia, the 50 was on AMD
akiratoya13
akiratoya13OP•2mo ago
yeah hahahaah! That's why I want to try the quantized one by unsloth team ( the 280gb size one ).
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Which quant is that?
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF The IQ1_S one. But also want to try 2-bit one, since it should be much better somehow with only a little differences.
Henky!!
Henky!!•2mo ago
I doubt those will be coherent You'd be better off with a smaller model Either way my template is https://koboldai.org/runpodcpp it can run these Just link to the 00001-of and it gets it
akiratoya13
akiratoya13OP•2mo ago
Yeah, for the full one, I already give up since I see the runpod price to achieve that. Anyway, it's not possible that I use that, but unlikely. If the quantized one already enough, I won't use the full one at all.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
1-bit tends to be incoherent
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Wouldn't be surprised if a good 100B beats it in 4-bit
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Hope my 32K context fits
No description
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Probably very slow 4xB200
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
hahaha. It's not all about agentic though... hmm. I try smolLM3 , and it's really good at tool calling etc, it can run the roocode somehow, but..... it's not good enough for coding @_@.
Henky!!
Henky!!•2mo ago
I am testing it with Q4 though
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
I wanna see what the speed is on a beefy nvidia
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
My community doesn't like the distils
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
For code GLM / Devstral maybe?
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Already try... It's not good enough. So far already try many kinds of DeepSeek & R1, Devstral, Kimi, Qwen. And nothing beats Kimi K2 so far... It's really really far ahead of the others. I think I only feels performance like that on Gemini Pro 2.5, never try claude though ( Of course gemini better, but kimi cheaper and can be free with rate limited 😄 ).
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Devstral ALMOST GOOD... But not enough Already try it too :D:D
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
You wont be able to beat Kimi (Maybe full sized deepseek gets close) but the smaller ones are way cheaper
akiratoya13
akiratoya13OP•2mo ago
Btw, @Jason you tell me about 4x A5000 << A5000 is the prev generation, is it still great though on performance? If yes, it's quite cheap @_@... yeah yeah ahahahah I'm sorry hahahaha
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
For the big models no Like if I find the Mi300x to slow for kimi an A5000 certainly will be slow I do have to test the Mi300x again after AMD's PR lands
akiratoya13
akiratoya13OP•2mo ago
I see...........
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
What the speed that you thought it's too slow btw?
Henky!!
Henky!!•2mo ago
50 tokens per second prompt processing Token generation speed was good with 30t/s But the 50t/s pp is very slow Feels like a beefy CPU But its just a massive model During generation its a 32B, but during processing its the full 1T
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Yup
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
I see...........
Henky!!
Henky!!•2mo ago
No description
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Expensive benchmark though, I really hope this 32K fits Not gonna download it a second time, its $10 for the download in this config
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
crazyyyyyyyyyy hahahaa. And here I just said I accept $1.5/hour.. lol
Henky!!
Henky!!•2mo ago
I am using 4xB200 because I want to speed test the best case scenario Its $23 an hour
akiratoya13
akiratoya13OP•2mo ago
So far if there's no problem from OpenRouter, RooCode + Kimi K2 is really really really really more than enough. << But I always use Test Driven Development to make sure the development is high quality. This is maybe why the token usage is so much. lol I want to cry.... hahahahaahha.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Best case scenario, 3 times faster than MI300x
No description
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Because it needs to start from zero code, and dummy test. then next it will test it, and it failing. After that, it will create 1 basic test. Then test it again << fail again. Then code as simple as possible to passed that test. Then test it again << if fail, fix it again till working, if works. It goes to next cycle.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Basically, I'm not letting the model to do complex coding before it's time, so it won't need to have super high intelligence. yeah.. So the request to API about 3 times more than usual at minimum I think.
Henky!!
Henky!!•2mo ago
They have a free kimi
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
Even with that beast, it's only getting 27t/s?? hahahahaah
Henky!!
Henky!!•2mo ago
Its a big model
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
I don't know why I cap out at 30t/s on all of them
akiratoya13
akiratoya13OP•2mo ago
yeah, this free kimi is that the one that I use.
Henky!!
Henky!!•2mo ago
Interesting the PP sped up a bit
No description
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Probably do
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
There is no batching For batching you gonna be using stuff like vllm not koboldcpp
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Probably, this isn't exactly what llamacpp/koboldcpp was designed to do lol But it did work, so thats something 😄 And on the B200 at 32K single user it was usable for me 2 mins to build the cache
akiratoya13
akiratoya13OP•2mo ago
Hmm. If I want to try using vLLM + gguf model, I need to make custom docker right? Because I need to add to script to download the model first, then run the starting command with loading that downloaded model. Am I right?
Henky!!
Henky!!•2mo ago
vLLM isn't that good at gguf
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
akiratoya13
akiratoya13OP•2mo ago
So, what what I should use? Normal llama.cpp ?
Henky!!
Henky!!•2mo ago
For gguf you can use the koboldcpp template
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Extremely easy to use just don't expect kimi to be economical
akiratoya13
akiratoya13OP•2mo ago
This one can download if it's not gguf by the way.. << I just read the documentation this afternoon.
Henky!!
Henky!!•2mo ago
No description
akiratoya13
akiratoya13OP•2mo ago
🤔 wait, let me check, what it is about.. I never read about koboldcpp..
Henky!!
Henky!!•2mo ago
Were based on llamacpp Comes with an optional bundled UI and a super easy runpod template
akiratoya13
akiratoya13OP•2mo ago
Yeah, I already read this.. That's a good point, I haven't check the kimi gguf download file, is it 1 file or multiple files. lol
Henky!!
Henky!!•2mo ago
Its 13 in the Q4 HF's upload limit is 50GB
akiratoya13
akiratoya13OP•2mo ago
I see...... Ah okay, will check it out..
Henky!!
Henky!!•2mo ago
KoboldCpp you give the first link and it automatically finds the others if you use a split https://get.runpod.io/koboldcpp customize before deploying Has built in download acceleration, no manual steps once deployed And we have official support for runpod
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
😄
akiratoya13
akiratoya13OP•2mo ago
HAHAHAHAAH. nice.. So, I can download the model from the UI? Or I need to use SSH? And... Is it already supporting OpenAI Compatible API? Provides many compatible APIs endpoints for many popular webservices (KoboldCppApi OpenAiApi OllamaApi A1111ForgeApi ComfyUiApi WhisperTranscribeApi XttsApi OpenAiSpeechApi) << oh this right?
Henky!!
Henky!!•2mo ago
Only yes to the last one
Henky!!
Henky!!•2mo ago
You edit these to pick a model
No description
Henky!!
Henky!!•2mo ago
In the args you can define the max context size (or reduce the layers for a partial cpu/gpu) And of course if you don't need them you can delete the image generation one, etc
akiratoya13
akiratoya13OP•2mo ago
Where can I see all the parameters available?
Henky!!
Henky!!•2mo ago
Like the screenshot or every KCPP_ARGS variable?
akiratoya13
akiratoya13OP•2mo ago
Oh, I mean like a link to the documentation, so I know all the names of the parameter and the description on it. hahahahahaha. Or that's already all?
Henky!!
Henky!!•2mo ago
These are for the KCPP_ARGS
Henky!!
Henky!!•2mo ago
The defaults are quite complete though
akiratoya13
akiratoya13OP•2mo ago
Sorry >,<.. I mean the parameter for the kobold pod though. hahahaah. These "KCPP_MODEL", "KCPP_ARGS", etc...
Henky!!
Henky!!•2mo ago
All of them are there except for KCPP_MMPROJ
akiratoya13
akiratoya13OP•2mo ago
Ah okay... Thank you thank you :D.. Will try it soon.
Henky!!
Henky!!•2mo ago
Should be a breeze to setup for GGUF as long as what your doing fits the specs
akiratoya13
akiratoya13OP•2mo ago
Henky!!
Henky!!•2mo ago
The full link of part 1
akiratoya13
akiratoya13OP•2mo ago
Sorry for asking more: so, do I need to use --usecuda ? since the doc said "if you're on windows" when on runpod we're using linux. Is this pod already accept "--flashattention" ? Ah I see.
Henky!!
Henky!!•2mo ago
If you don't intend to put everything on the GPU you also need to customize --gpulayers but I don't know the optimal partial offload for this modell --usecuda you need yes, it forces nvidia GPU mode --flashattention should be specified in KCPP_ARGS already
akiratoya13
akiratoya13OP•2mo ago
From the doc, I read I can just put "-1" so it's automatically offload what it needs. Okay2, Thanks!!!
Henky!!
Henky!!•2mo ago
Don't do that It can currently only see the vram of one GPU when it picks them It will put to few -1 is meant for home users
akiratoya13
akiratoya13OP•2mo ago
I see.. Okay... let say if I have 10 VRAM in total, and the model is 40GB. How do I know max GPU offload value so I know how much I need to put the value? << I never know how to calculate this.
Henky!!
Henky!!•2mo ago
10vram on a single GPU?
akiratoya13
akiratoya13OP•2mo ago
Let say it's across 4GPU to make the example more relevant. hhahaha
Henky!!
Henky!!•2mo ago
We usually trial and error Lets say it has 40 layers to keep it simple Your 4 times over So then its around 10 layers MoE's are a bit of a special case Unsloth recommends keeping it 99 but using override tensors --overridetensors ".ffn_.*_exps.=CPU" is one they say you can try
Henky!!
Henky!!•2mo ago
No description
Henky!!
Henky!!•2mo ago
For us -ot is --overridetensors
akiratoya13
akiratoya13OP•2mo ago
Oh wow... Okay2. Will read that first too...
Henky!!
Henky!!•2mo ago
That puts specific ones on the GPU so its more performant Thing is on runpod because ram and vram are tied its not always a good idea The MoE's consume a lot of ram usually, non MoE's only consume the ram for stuff thats not on the GPU
akiratoya13
akiratoya13OP•2mo ago
hooo.. ic ic.. Okay, I think I will need some trial and error a little. At least I will post how the result in here and is it good enough to run roocode :D.
Unknown User
Unknown User•2mo ago
Message Not Public
Sign In & Join Server To View
Henky!!
Henky!!•2mo ago
Personally I doubt it Roo code switches prompts all the time, ton of reprocessing Gonna have long delays, tunnnel timeouts if you don't edit the template so you can connect to port 5001 over TCP And I think in general unless you hyper optimize it it will be more expensive than API providers A model this large single user just isn't economical For the smaller ones like GLM / Devstral runpod + koboldcpp works very well Same if you use a 100B
akiratoya13
akiratoya13OP•2mo ago
I see..... >,<................. noted....

Did you find this page helpful?