Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

$0 balance in my account

Hi, I had about $25 USD last night in my account. This morning I received a message to replenish my account as it was empty. I would like to understand what happened, as I do not have a running pod or serverless instance. Thanks....
Solution:
It is likely that your worker continuously started up but was unsuccessful, this would have still resulted in your account being charged. If you DM I can provid you a small credit to verify if this was the case

vllm + Ray issue: Stuck on "Started a local Ray instance."

Trying to run TheBloke/goliath-120b-AWQ on vllm + runpod with 2x48GB GPUs: `` 2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache()`. 2024-02-03T12:36:44.149745508Z 0it [00:00, ?it/s]...

Similar speed of workers on different GPUs

Hi, I am trying to launch the codeformer model on a serverless GPU. However, during testing I've noticed that it doesn't matter which GPU I choose, the speed would stay the same. So a100 works with the almost identical speed as with a4500. How could I fix that issue?
Solution:
Main benefit of A100 is VRAM not performance, if you want performance, select 24GB PRO which is 4090.

Docker daemon is not started by default?

In the template I specify docker run command, but the worker cannot execute the container because daemon is not running. docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? I added systemctl start docker command before docker run, but systemctl is not recognised. How to make this container start?
Solution:
You can't run Docker in Docker on RunPod.

VLLM Worker Error that doesn't time out.

2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
...
Solution:
refresh_worker does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?

quick python vLLM endpoint example please?

…I’ve been on this for 2 hours and the best I can get so far is have a bunch of stuff endlessly ‘queued’. I’m getting responses from the test thing on the ‘my endpoints’ page but my python script isn’t working… 😅...
Solution:
Here's the answer btw ```import requests import json ...

Best way to deploy a new LLM serverless, where I don't want to build large docker images

The infrastructure I have come across at runpod, there is not much support for serverless for fast copying of weights of models from a local data centre. Can I get some suggestions on how I should plan my deployment because building large docker images and uploading them on docker file, and then server downloading it at cold start, takes massive time. Help will be appreciated.

Pause on the yield in async handler

I have wrote async handler. Messages are realy small, about several kilobites async for msg in search.run_search_generator(request): start_time = time.perf_counter() yield msg...

worker-vllm cannot download private model

I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model. ```bash export DOCKER_BUILDKIT=1 export HF_TOKEN="your_token"...

How do I select a custom template without creating a new Endpoint?

Hi, right now I need to create a new template to do a new release. The issue is, the platform as is, doesn't let you pick a new Template only modify what is attached to it, that is problematic if I have different Endpoints that share the same template. Do I need to create a new endpoint every time I need to just select a template?...

Slow initialization, even with flashboot, counted as execution time

I am running a serverless Fooocus API endpoint from this code base https://github.com/davefojtik/RunPod-Fooocus-API. It takes a long time to initialize, even with Flashboot, and the initialization counts as execution time. In subsequent runs with Flashboot, the time is dramatically lower, until Flashboot's cache clears. The issue is raised and discussed here https://github.com/davefojtik/RunPod-Fooocus-API/issues/5...

worker vllm 'build docker image with model inside' fails

from the page https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file
Option 2: Build Docker Image with Model Inside To build an image with the model baked in, you must specify the following docker arguments when building the image. ...

Getting TypeError: Failed to fetch when uploading video

I've been able to upload videos just fine before, up to 1080p quality, but this morning it isn't working either in ComfyUI or Jupyter. I can't even upload videos that I could previously - is there a length limit that I'm not aware of, or something else going on? I am using a network volume, and there is plenty of space

SSLCertVerificationError from custom api

I am trying to create my own api using python but i am having error when job is submitted. Error Type: ClientConnectorCertificateError | Error Message: Cannot connect to host api.runpod.ai:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')] Even using this basic sample throws the same error ...
Solution:
This is fixed. although its mentioned in docs that minimum 3.8 lambda is required for runpod python module but i was able to get it running with python 3.11. under 3.11 i get the ssl error . even upgrading the certificate store did not help.

Does async generator allow a worker to take off multiple jobs? Concurrency Modifier?

I was reading the runpod docs, and I saw the below. But does an async generator_handler, mean that if I sent 5 jobs for example that one worker will just keep on picking up new jobs? I also tried to add the: ``` "concurrency_modifier": 4...
Solution:
``` import runpod import asyncio import random ...

Does Runpod provide startup free computes grant?

I am building a Stable Diffusion app and would love to get assistant. Modal seems to be offering it and it would be great if Runpod offers it too....

Custom Checkpoint Model like DreamShaper

How can we use Custom SD models like DreamShaper on Civitai for endpoints? Thanks!

How to force Runpod to pull latest docker image?

I have built the docker image that runpod template relies on, and I want runpod to use the latest docker image i just built, how can i do it? I saw there is a button called New Release but it's asking me to use a new tag. I don't want to use a new tag. My docker image is using the same old tag....

Endpoint creation can't have envs variables

After creating a template with some envs variables, i went into the endpoint creation and the Add variables is gray so i can't click on it ... Just in case i already tried to refresh the UI and checked in Devtools to see if the API was the cause but no....
No description

How to get around the 10/20 MB payload limit?

For use cases such as training LoRAs with Stable Diffusion, where a user could upload tens of photos, 10/20MB is quite small. This is especially true because you have to convert an image to base64 before sending it to the endpoint, which will increase the size of each photo. My app requires the user to upload photos of themselves for training purposes. And if I can't find a way around the 10 MB payload limit, I just realized I can't use runpod's serverless GPUs. Are there any clever ways of getting around this payload limit?...
Solution:
Upload your photos to cloud storage and your serverless workers can download from a link. The limits are fixed and there is no way around them, you must use a link to download the resources instead.