zkreutzjanz Posts - Answer Overflow

zkreutzjanz

Posts Comments

RRunPod

•Created by zkreutzjanz on 10/11/2024 in #⛅｜pods-clusters

Deploy pod without scheduled downtime

6 replies

RRunPod

•Created by zkreutzjanz on 7/25/2024 in #⛅｜pods-clusters

3 pods inaccessible after network outtage

There was a network outage in EU NO and the pods are up, but cannot start:

error creating container: container: create: container create: Error response from daemon: layer does not exist

error creating container: container: create: container create: Error response from daemon: layer does not exist

This is a second time an incident like this has occurred. I have >2 TB of storage I cannot access. Am I being billed for these pods? No response from support.

5 replies

RRunPod

•Created by zkreutzjanz on 7/12/2024 in #⛅｜pods-clusters

Multi Node training with torchrun/slurm

Has anyone here ever tried multinode on runpod? I am thinking of setting this up but if people have encountered prohibitive network speeds I do not see a reason to.

8 replies

RRunPod

•Created by zkreutzjanz on 6/21/2024 in #⚡｜serverless

Slow IO speeds on serverless

An A6000 always active worker takes twice as run to run my code than a normal A6000, I think it is IO speed. How can I see IO speeds?

10 replies

RRunPod

•Created by zkreutzjanz on 6/16/2024 in #⛅｜pods-clusters

Pod Maintenance update days after

17 replies

RRunPod

•Created by zkreutzjanz on 5/27/2024 in #⚡｜serverless

Clone endpoint failing in UI

{
    "errors": [
        {
            "message": "Something went wrong. Please try again later or contact support.",
            "locations": [
                {
                    "line": 1,
                    "column": 23
                }
            ],
            "extensions": {
                "code": "BAD_USER_INPUT"
            }
        },
        {
            "message": "Something went wrong. Please try again later or contact support.",
            "locations": [
                {
                    "line": 1,
                    "column": 23
                }
            ],
            "extensions": {
                "code": "BAD_USER_INPUT"
            }
        }
    ]
}

{
    "errors": [
        {
            "message": "Something went wrong. Please try again later or contact support.",
            "locations": [
                {
                    "line": 1,
                    "column": 23
                }
            ],
            "extensions": {
                "code": "BAD_USER_INPUT"
            }
        },
        {
            "message": "Something went wrong. Please try again later or contact support.",
            "locations": [
                {
                    "line": 1,
                    "column": 23
                }
            ],
            "extensions": {
                "code": "BAD_USER_INPUT"
            }
        }
    ]
}

User input,sensitive information removed:

{"operationName":"saveEndpoint","variables":{"input":{"gpuIds":"AMPERE_48,ADA_48_PRO,-NVIDIA A40,-NVIDIA L40","gpuCount":1,"allowedCudaVersions":"","idleTimeout":5,"locations":null,"name":"myendpoint-dev (cloned)","networkVolumeId":null,"scalerType":"QUEUE_DELAY","scalerValue":4,"workersMax":1,"workersMin":0,"executionTimeoutMs":600000,"template":{"containerDiskInGb":25,"containerRegistryAuthId":"","dockerArgs":"","env":[{},{],"imageName":"my-imaage","name":"endpoint-dev (cloned)__template"}}},"query":"mutation saveEndpoint($input: EndpointInput!) {\n  saveEndpoint(input: $input) {\n    gpuIds\n    id\n    idleTimeout\n    locations\n    name\n    networkVolumeId\n    scalerType\n    scalerValue\n    templateId\n    userId\n    workersMax\n    workersMin\n    gpuCount\n    __typename\n  }\n}"}

{"operationName":"saveEndpoint","variables":{"input":{"gpuIds":"AMPERE_48,ADA_48_PRO,-NVIDIA A40,-NVIDIA L40","gpuCount":1,"allowedCudaVersions":"","idleTimeout":5,"locations":null,"name":"myendpoint-dev (cloned)","networkVolumeId":null,"scalerType":"QUEUE_DELAY","scalerValue":4,"workersMax":1,"workersMin":0,"executionTimeoutMs":600000,"template":{"containerDiskInGb":25,"containerRegistryAuthId":"","dockerArgs":"","env":[{},{],"imageName":"my-imaage","name":"endpoint-dev (cloned)__template"}}},"query":"mutation saveEndpoint($input: EndpointInput!) {\n  saveEndpoint(input: $input) {\n    gpuIds\n    id\n    idleTimeout\n    locations\n    name\n    networkVolumeId\n    scalerType\n    scalerValue\n    templateId\n    userId\n    workersMax\n    workersMin\n    gpuCount\n    __typename\n  }\n}"}

29 replies

RRunPod

•Created by zkreutzjanz on 5/3/2024 in #⛅｜pods-clusters

How to tell how much storage being used in pod? (including network drive)

I try df -h, but it seems to represent the whole filesystem.

(base) root@f3165c77df52:/workspace# df -h
Filesystem                    Size  Used Avail Use% Mounted on
overlay                        30G  8.9G   22G  30% /
tmpfs                          64M     0   64M   0% /dev
tmpfs                         117G     0  117G   0% /sys/fs/cgroup
shm                            87G  106M   87G   1% /dev/shm
/dev/vda1                      39G  9.3G   30G  24% /usr/bin/nvidia-smi
mfs#eur-no-1.runpod.net:9421  175T   74T  102T  43% /workspace
/dev/vda2                     1.7T  735G  925G  45% /etc/hosts
udev                          117G     0  117G   0% /dev/tty
tmpfs                         117G   12K  117G   1% /proc/driver/nvidia
tmpfs                         117G  4.0K  117G   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs                          24G  1.6M   24G   1% /run/nvidia-persistenced/socket
tmpfs                         117G     0  117G   0% /proc/acpi
tmpfs                         117G     0  117G   0% /proc/scsi
tmpfs                         117G     0  117G   0% /sys/firmware

(base) root@f3165c77df52:/workspace# df -h
Filesystem                    Size  Used Avail Use% Mounted on
overlay                        30G  8.9G   22G  30% /
tmpfs                          64M     0   64M   0% /dev
tmpfs                         117G     0  117G   0% /sys/fs/cgroup
shm                            87G  106M   87G   1% /dev/shm
/dev/vda1                      39G  9.3G   30G  24% /usr/bin/nvidia-smi
mfs#eur-no-1.runpod.net:9421  175T   74T  102T  43% /workspace
/dev/vda2                     1.7T  735G  925G  45% /etc/hosts
udev                          117G     0  117G   0% /dev/tty
tmpfs                         117G   12K  117G   1% /proc/driver/nvidia
tmpfs                         117G  4.0K  117G   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs                          24G  1.6M   24G   1% /run/nvidia-persistenced/socket
tmpfs                         117G     0  117G   0% /proc/acpi
tmpfs                         117G     0  117G   0% /proc/scsi
tmpfs                         117G     0  117G   0% /sys/firmware

13 replies

RRunPod

•Created by zkreutzjanz on 5/1/2024 in #⛅｜pods-clusters

How to get a general idea for max volume size on secure cloud?

I have been able to deploy 2TB drives, but what is the standard here? How much storage is there generally per server to estimate what i should expect to be able to get?

37 replies

RRunPod

•Created by zkreutzjanz on 4/22/2024 in #⛅｜pods-clusters

NVENC driver conflict

3 replies

RRunPod

•Created by zkreutzjanz on 3/17/2024 in #⚡｜serverless

A6000 serverless worker is failing for an unknown reason.

In the last week a few of our serverless workers have been failing on all requests. Trying to narrow down a common denominator right now, seems to just be an A6000 issue.

1 replies

RRunPod

•Created by zkreutzjanz on 3/14/2024 in #⛅｜pods-clusters

GPU usage when pod initialized. Not able to clear.

Tried nvidia-smi -r, restarting, and reseting. There is still usage on one gpu in the pod.

6 replies

RRunPod

•Created by zkreutzjanz on 2/25/2024 in #⛅｜pods-clusters

Pod running but inaccessible

1 replies

Gaming

Programming