Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Serverless Unable to SSH / Use Jupyter Notebook Anymore

When I used to use Runpod when I first started, if I had an active worker, I could ssh / use a jupyter notebook if I had ssh open / notebook launched on the pod. But now when I try to ssh, it just throws me an error: ``` Justins-MBP ~ % ssh m3k8sad75isko8-64410faa@ssh.runpod.io -i ~/.ssh/id_ed25519...
No description

Editing Serverless Template ENV Variable

When I edit a serverless template env variable, does it update in real time? Just wondering, I sort-of can't tell, but wondering what is happening under the hood. Do I need to refresh the workers myself, or is it when Idle workers > go active, will autograb new env variables?

llama.cpp serverless endpoint

https://github.com/ggerganov/llama.cpp
llama.cpp is afak the only setup that supports llava-1.6 quantized, that's why i use it. On some workers the docker image works, on others "illegal instruction" error and crash. https://github.com/ggerganov/llama.cpp/issues/537...
Solution:
I don't know why you would want to use llama.cpp, its more for offloading onto CPU than for GPU. You can look at using this instead: https://github.com/ashleykleynhans/runpod-worker-llava...

comfyui + runpod serverless

I'm looking to host my comfyui workflow via runpod serverless. I'm curious how does the comfyui startup process work with serverless. For example, in my local setup, everytime I restart my comfyui localhost, it takes awhile to get up and running, let's call this the "comfyui cold start". But once it is setup, it's relatively quick to run many generations one after another. My Question: ...

ECC errors on serverless workers using L4

We are currently using L4 machines in the eu-ro region for our production environment(30~70 workers). Based on the requests data, we have seen increasing hardware issues related to ECC errors and was wondering if we could get help in mitigating these failures. `` "handler: CUDA error: uncorrectable ECC error encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA` to enable device-side assertions....

Does Runpod Autoupdate Images now for non-matching hashes?

I had only idle workers, and I sent a request to do some testing, suddenly, it started downloading a new image??? The only explanation I have of this is I have a CI/CD pipeline I'm testing that pushed a new image up with the same name. Is runpod just downloading new images now if the hashes don't match? You can see it b/c instead of being in "initializing" its a green worker....
No description

VllM Memory Error / Runpod Error?

https://pastebin.com/vjSgS4up
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
...

How do I correctly stream results using runpod-python?

Currently, I'm doing the following: ------- import runpod...

Status endpoint only returns "COMPLETED" but no answer to the question

I'm currently using the v2/model_id/status/run_id endpoint and the results I get is follows: {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"} My stream endpoint works fine but for my purposes I'd rather wait longer and retrieve the entire result at once, how am I supposed to do that? ...
Solution:

24GB PRO availability in RO

I switched from 24GB tier in RO to 24GB PRO to benefit from the higher availability of the 4090's in RO, but most of my workers are becoming throttled again.
No description

Deepseek coder on serverless

Hello, new serverless user here, i would be using the vllm worker, so whenever it gets spun up from a coldstart, does it have to download the model everytime? Id be running it in fp16 which means it be about 14gb of data to download

How to write a file to persistent storage on Serverless?

Hey guys, can someone help me write a file to persistent storage on Serverless, I want to then allow users to download it directly from the storage, and clean up the volume after 24 hours. Any help here would be great!!...

Run LLM Model on Runpod Serverless

Hi There, I have LLM Model which build on docker image and it was 40GB++ docker Image. I'm wondering, can I mount the model as volume instead of add the model in the docker image?...

Safetensor safeopen OS Error device not found

Running inference on severless endpoint and this line of code:
with safetensors.safe_open(path, framework="pt", device="cpu") as f:
with safetensors.safe_open(path, framework="pt", device="cpu") as f:
...

Directing requests from the same user to the same worker

Guys, thank you for your work. We are enjoying your platform. I have the following workflow. On the first request from the user, the worker does some hard stuff about 15-20s, caches hard stuff and all subsequent requests are very fast ~150ms. But if some of the subsequent requests goes to another worker, it should repeat this hard stuff again (15-20s). Is there any possibility to direct all the subsequent calls from the same user to the same worker?...
Solution:
Just a summary so I can mark this solution: 1) Can use network storage to persist data in between runs 2) Use a outside file storage / object storage provider 3) If using Google cloud / S3 Bucket, for large files can use parallel downloads / uploads; there should be existing tooling out there; or can obvs custom make ur own...

Serverless webhook for executionTimeout

Hi, We've just added an executionTimeout for our serverless jobs. I understand that when you supply a webhook, a request is sent when a job is completed. Is it possible to send a webhook request when the executionTimeout is hit as well? Ideally we want to update our db when a job is complete or has failed (due to taking too long)...

Is there any way to do dynamic batching?

Say I have a vision model deployed and I send 5 images within x time is there a way to actually stack the images, pass them through the model and return the 5 responses? I was able to find concurrent handlers etc. but nothing actual batching (other than sending them all in the same request of course)

Started getting a lot of these "Failed to return job results" errors. Outage?

```json { "dt": "2024-02-15 08:20:07.490148", "endpointid": "1o6zoaofipeyuh", "level": "error",...