Possible memory leak on Serverless

We're testing different mistral models (cognitivecomputations/dolphin-2.6-mistral-7b and TheBloke/dolphin-2.6-mistral-7B-GGUF) and running into the same problem regardless of what size GPU we use. After 20 or so messages the model starts returning empty responses, we've been trying to debug this every way we know how but it just doesn't make sense as the context size is around the same for each message so its due to an increasing number of prompt tokens. What i've noticed is that even when the worker isn't processing any requests, the GPU memory stays (nearly) maxed out. The only thing I can think of is the the new requests don't have enough ram to processes as its already full
No description
Alpay Ariyak43d ago
The GPU memory being near max memory usage is expected with vLLM In terms of empty messages, seems more like a problem with the model or vLLM. Have you tried 20+ messages with regular vLLM on a pod or any other inference engine?
AdamLGPT43d ago
It's a very popular model and I can't find any other people complaining about this problem so I can only assume its related to the docker container or hardware. I'm using the runpod/worker-vllm:0.3.0-cuda11.8.0 image which is also popular and I haven't found anyone complaining about this issue of empty messages after X number of messages. It seems like it must be hardware related as the messages being sent are all very similar to each other, yet after a while it just starts returning "\r\n\r\n..." in a long string
Zack12d ago
Following up here, I think I might be seeing this same issue. I see a slow creep of memory and eventual empty outputs with CUDA OOM. Was there any resolution or progress on understanding this issue? I'm pretty sure there was a leak in my inference code. Switching wholesale over to vLLM did resolve this, even if I didnt end up getting a root cause on it
Alpay Ariyak11d ago
Hardware-related issues can’t affect the output tokens, so it’s vLLM-related Hi, I’m not sure what you mean by wholesale, could you please elaborate
Zack8d ago
Switching all my inference code over to vLLM is what I meant by "wholesale". I did find out that the empty responses were too large of inputs. Memory leak seems to have been something with my inference code that was resolved by switching to vLLM
Want results from more Discord servers?
Add your server
More Posts
are we able to run DinD image for GPU pods?Hi, anyone tried running DinD in GPU pods?Runpod error starting container2024-03-07T14:40:19Z error starting container: Error response from daemon: failed to create task forRunpod SD ComfyUI Template missing??Where did the "Runpod SD ComfyUI" template go? Can anyone help? I've been using it extensively for aDockerless dev and deploy, async handler need to use async ?handler.py in HelloWorld project, there is not 'async' before def handler(job): . But in serverlesSomething broken at 1am UTCSomething was broken at 1am UTC which caused a HUGE spike in my cold start and delay times.Should I use Data Centers or Network Volume when confige serverless endpoint ?My project is an AI portrait app targeting global users. The advantage of using data centers is the Pod OutageCurrently taking 100x longer to pull the docker image and when it eventually builds I have an API seAre stream endpoints not working?This is a temp endpoint just to show you all. /stream isn't available, what's up?Cuda - Out of Memory error when the 2nd GPU not utilizedI have a pod with 2 x 80 GB PCIe and I am trying to load and run Smaug-72B-v0.1 LLM. The problem is,Postman returns either 401 Unauthorized, or when the request can be sent it returns as Failed, errorPostman reads the following, when I send runsync request from runpod tutorial (from generativelabs) Backdrop Build V3 Credits missingHi team, I hope this message finds you well. I am writing to follow up on the recent offer I receivText-generation-inference on serverless endpointsHi, I don't have much experience neither with llms nor with python, so I always just use this image When on 4000 ADA, it's RANDOMLY NOT DETECTING GPU!When on 4000 ADA, it's RANDOMLY NOT DETECTING GPU! Yesterday I set it up and it's okay. Today I set Cold Start Time is too longWhen i test a HelloWorld project, run , it take too much time. Worker Configuration as attachment, IWhat happened to the webhook graph?There was a webhook graph for serverless but I can't seem to find it anymore. Was it removed for soHow i can use more than 30 workers?i've tested my task with 30 workers and realized that i need more) is it possible to get 40 or more?What is the caching mechanism of RUNPOD docker image?our Docker image is stored in AWS ECR. We've noticed that every time we update the Docker template ocant get my pod to work righthi im new to runpod im trying to add models and loras to my runpod as well as trying to install runp