What is the recommended GPU_MEMORY_UTILIZATION?
All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM.
1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%?
2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?
1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%?
2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?