OOM on LLaMA 3.1 70B Serverless

Hey Runpod Suportam — hitting OOM on LLaMA 3.1 70B Serverless and looking for advice from anyone who's solved this.

Setup:
- Endpoint: Serverless, vllm/vllm-openai:v0.8.0, custom handler.py
- GPU: A100 80GB
- Model: BF16 Meta LLaMA 3.1 70B stored locally on network volume at /runpod-volume/models/llama-3.1-70b
- LoRA adapter (rank=64, alpha=128) at /runpod-volume/llama-ea-finetuned
- max-model-len: 8192, gpu-memory-utilization: 0.95

Error: CUDA OOM at 78.65GB during weight loading. Model is 140GB BF16, GPU is 80GB — we know it doesn't fit as-is.

Plan: Convert to AWQ (4-bit) on our HPC cluster using autoawq, then redeploy with --quantization awq.

Questions:
1. Has anyone successfully run AWQ LLaMA 3.1 70B + LoRA on vllm v0.8.0 on a single 80GB GPU?
2. Any gotchas with LoRA on an AWQ base in vLLM — does it actually work cleanly?
3. Any alternative that's worked for you to fit 70B on 80GB while keeping LoRA?

Thanks