Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Worker log says remove container, remove network?

Not even sure this is an issue, but one of my endpoints I'm testing has a throttled worker that has an odd output in their log. I'm not sure if it's crashed and been removed or just deallocated or something? ``` 2024-01-10T14:00:00Z create pod network 2024-01-10T14:00:00Z create container ghcr.io/bartlettd/worker-vllm:main ...
Solution:
thats normal, unless worker is running

Hi all. I created a pod, started it, but can't ssh, can't start its "web terminal", can't do anythin

I've created a new pod, started it, added the RSA keys, etc… however, can't ssh; Error response from daemon: Container f3aeaa504300180e74107f909c00ece20c4e18925c55c45793c83c9d3dc52852 is not running Connection to 100.65.13.88 closed. Connection to ssh.runpod.io closed....

Should I be getting billed during initialization?

Trying to understand exactly how serverless billing works with respect to workers initialising. From the GUI, behaviour is inconsistent and I can't find an explanation in the docs. I have an example where workers are pulling a docker image, one of the workers says they're ready despite still pulling the Image while the other two are in the initialising state. The indicator in the bottom right shows the per second pricing for one worker which would make sense if its active, but it clearly isn't ready to accept jobs. Also, pulling images from Github container registry takes an absolute age, I'd be disappointed about getting charged more because of network congestion....
Solution:
we have seen this happen if you update your container using same tag
No description

[RUNPOD] Minimize Worker Load Time (Serverless)

Hey fellow developers, I'm currently facing a challenge with worker load time in my setup. I'm using a network volume for models, which is working well. However, I'm struggling with Dockerfile re-installing Python dependencies, taking around 70 seconds. API request handling is smooth, clocking in at 15 seconds, but if the worker goes inactive, the 70-second wait for the next request is a bottleneck. Any suggestions on optimizing this process? Can I use a network volume for Python dependencies like I do for models, or are there any creative solutions out there? Sadly, no budget for an active worker....
Solution:
Initializing models over a network volume can inherently be slow bc ur booting from a different harddrive. If u can is easier to bake into the docker image as ashelyk said. Ur other option is increase idle times after a worker is active that way ur first request is initialized the model into vram and subsequent requests are easy to pick up for the worker...
No description

Runpod VLLM Context Window

Hi I've been using this template in my serverless endpoint https://github.com/runpod-workers/worker-vllm I'm wondering what my context window is/how its handling chat history? ...

Real time transcription using Serverless

creation of handler file for real time transcription app
Solution:
heres the high level function handler - edit template and add ports to it, range can be anything, e.g. 5000-5010, depends if your workload can do parallel inference or its 1 per gpu (you will need our help to do this, its not available to public yet) - start with first port 5000, open websocket server - look at env for external port # for 5000, and get ip from env - use inprogress hook in our sdk to send ip, port and any other info you want...

ailed to load library libonnxruntime_providers_cuda.so

Here is the full error: [E:onnxruntime:Default, provider_bridge_ort.cc:1480 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcufft.so.10: cannot open shared object file: No such file or directory I am running AUTOMATIC1111 on Serverless Endpoints using a Network Volume. I am using the faceswaplab extension. In this extension, there is the option to use GPU (by default, the extension only uses CPU). When I turn on the Use GPU option, I get the error....

Setting up MODEL_BASE_PATH when building worker-vllm image

I'm a little confused about this parameter in setting up worker-vllm. It seems to default to /runpod-volume, which to me implies a network volume, instead of getting baked into the image, but I'm not sure. A few questions: 1) If set to "/runpod-volume", does this mean that the model will be downloaded to that path automatically, and therefore won't be a part of the image (resulting in a much smaller image)? 2) Will I therefore need to set up a network volume when creating the endpoint? 3) Does the model get downloaded every time workers are created from a cold start? If not, then will I need to "run" a worker for a given amount of time at first to download the model?...
Solution:
Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs For example: 1. sudo docker build -t xyz:123 . and add cuda version arg if you need...

What does the delay time and execution mean in the request page?

Hey all, I'm not sure what the delay time mean in the Requests page. Is it about the cold start? Could someone help me understand it? Also, the execution time seems to be way larger than the duration I've logged. Is the execution time means the excution time of the handler function? Thanks!
Solution:
Yes, execution time is the execution time of the handler function runpod.serverless.start(). Delay time is not only cold start time, but also includes the time that your request is in the queue before a worker picks it up. Delay time can be dramatically impacted if all of your workers are throttled.

Extremely slow Delay Time

We are using 2 serverless endpoints on runpod and the "Delay Time" (which I assume measures end to end time) varies drastically between the endpoints. They both use the same hardware (the A5000 option) and one of them has sub-second delay times and the other ~50 seconds up to 180s. On the slow endpoint, the worst cold start time is reported as 13s, and the execution time is ~2s, which don't add up to the delay time. There are ~50 seconds unnacounted for. The other endpoint using the same hardware does not observe such drastic delay time....
Solution:
Delay time is NOT end to end time. It is the cold start time + the time that your request is in the queue for before a worker picks it up. Delay time can be dramatically impacted if all of your workers are throttled.

Custom template: update environment variables?

I have configured environment variables in my custom endpoint template. When I edit the template to change their contents, the workers still seem to be using the old ENV values. What ultimately works is removing and recreating the whole endpoint, but I don't want to do that repeatedly. I've tried triggering a refresh using the "Create new release" functionality but it didn't seem to help. What is the recommended way of making sure that the workers are using the latest environment variables from the template?...
Solution:
Scale workers down to zero and back up again for environment variable changes to take effect

Delay on startup: How long for low usage?

I am trying to gauge the actual cold start for a 7B LLM deployed with vLLM. My ideal configuration is something like this: 0 active workers, 5 requests/hour, and up to between 100-200 seconds of generation time. How long would it take for RunPod to do a cold start with delay time and everything? Essentially, what is the min, avg, max in terms of time to first token generated?...

Why not push results to my webhook??

Why not push results to my webhook

Restarting without error message

I'm deploying some code to serverless and it seems the code crashes and restarts the process, without an error message. In the logs it just shows that it has restarted, I can tell by my own startup logging. In the end I could make it work by using an specific version of CUDA and an specific version of a dependency, but I would like to know why it crashes, to fix it. Everything works fine locally with nvidia-docker......

Set timeout on each job

Hello, is there anyway to set a hard limit timeout for each job? Thank you!

issues using serverless with webhook to AWS API Gateway

For some reason my api gateway does not receive any of request from runpod. My API gateway does not require any authorization so I can’t think of why it does not go through. My runpod endpoint id is duy9bf9dm50ag7...
Solution:
Turned it was due to the limit size that AWS has

Monitor Logs from command line

Hello all, is there any command line tool to monitor of an endpoint without opening up the webpage? Thanks!
Solution:
You can use your browser's console to check the API calls that are made and then use the API to get the logs

What does "throttled" mean?

My endpoint dashboard sometimes shows "1 Throttled" worker, and 0 other workers, except for queued ones. What does the "throttled" status mean, and how do I prevent the condition?
Solution:
From my understanding, and this is by no way official: Throttled means that other services are using the GPU. I recommend, to have at least 2 max workers (which runpod will then allocate 5 workers on your endpoint), which will have the ability to "potentially" pick up jobs with the maximum workers ever working being the amount you chose. There is no way to prevent it unless you require some "minimum" amount of working to always be active. ...

Error building worker-vllm docker image for mixtral 8x7b

I'm running the following command to build and tag a docker worker image based off of worker-vllm: docker build -t lesterhnh/mixtral-8x7b-instruct-v0.1-runpod-serverless:1.0 --build-arg MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" --build-arg MODEL_BASE_PATH="/models" . I'm getting the following error:...

qt.qpa.plugin error with sd-scripts/sdxl_gen_img.py

I implemented SDXL LoRa training similar to https://github.com/runpod-workers/worker-lora_trainer using https://github.com/kohya-ss/sd-scripts which seems to work fine now. So I figured it would be very simple to also use the provided https://github.com/kohya-ss/sd-scripts/blob/main/sdxl_gen_img.py script for image generation, but even a very basic call always results in this error and kills my worker: qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "/usr/local/lib/python3.8/..." even though it was found. qt.qpa.xcb: could not connect to display ...
Solution:
RunPod has no physical display, I suggest logging an issue in the kohya scripte Github repo. This is not a RunPod issue, you are asking for help in the wrong place.