Hey Runpod Suportam — hitting OOM on LLaMA 3.1 70B Serverless and looking for advice from anyone who's solved this.
Setup: - Endpoint: Serverless, vllm/vllm-openai:v0.8.0, custom handler.py - GPU: A100 80GB - Model: BF16 Meta LLaMA 3.1 70B stored locally on network volume at /runpod-volume/models/llama-3.1-70b - LoRA adapter (rank=64, alpha=128) at /runpod-volume/llama-ea-finetuned - max-model-len: 8192, gpu-memory-utilization: 0.95
Error: CUDA OOM at 78.65GB during weight loading. Model is 140GB BF16, GPU is 80GB — we know it doesn't fit as-is.
Plan: Convert to AWQ (4-bit) on our HPC cluster using autoawq, then redeploy with --quantization awq.
Questions: 1. Has anyone successfully run AWQ LLaMA 3.1 70B + LoRA on vllm v0.8.0 on a single 80GB GPU? 2. Any gotchas with LoRA on an AWQ base in vLLM — does it actually work cleanly? 3. Any alternative that's worked for you to fit 70B on 80GB while keeping LoRA?
Thanks
Recent Announcements
Continue the conversation
Join the Discord to ask follow-up questions and connect with the community
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!