zkreutzjanz
zkreutzjanz
Explore posts from servers
RRunPod
Created by zkreutzjanz on 7/12/2024 in #⛅|pods
Multi Node training with torchrun/slurm
Has anyone here ever tried multinode on runpod? I am thinking of setting this up but if people have encountered prohibitive network speeds I do not see a reason to.
8 replies
RRunPod
Created by zkreutzjanz on 6/21/2024 in #⚡|serverless
Slow IO speeds on serverless
An A6000 always active worker takes twice as run to run my code than a normal A6000, I think it is IO speed. How can I see IO speeds?
10 replies
RRunPod
Created by zkreutzjanz on 6/16/2024 in #⛅|pods
Pod Maintenance update days after
No description
17 replies
RRunPod
Created by zkreutzjanz on 5/27/2024 in #⚡|serverless
Clone endpoint failing in UI
{
"errors": [
{
"message": "Something went wrong. Please try again later or contact support.",
"locations": [
{
"line": 1,
"column": 23
}
],
"extensions": {
"code": "BAD_USER_INPUT"
}
},
{
"message": "Something went wrong. Please try again later or contact support.",
"locations": [
{
"line": 1,
"column": 23
}
],
"extensions": {
"code": "BAD_USER_INPUT"
}
}
]
}
{
"errors": [
{
"message": "Something went wrong. Please try again later or contact support.",
"locations": [
{
"line": 1,
"column": 23
}
],
"extensions": {
"code": "BAD_USER_INPUT"
}
},
{
"message": "Something went wrong. Please try again later or contact support.",
"locations": [
{
"line": 1,
"column": 23
}
],
"extensions": {
"code": "BAD_USER_INPUT"
}
}
]
}
User input,sensitive information removed:
{"operationName":"saveEndpoint","variables":{"input":{"gpuIds":"AMPERE_48,ADA_48_PRO,-NVIDIA A40,-NVIDIA L40","gpuCount":1,"allowedCudaVersions":"","idleTimeout":5,"locations":null,"name":"myendpoint-dev (cloned)","networkVolumeId":null,"scalerType":"QUEUE_DELAY","scalerValue":4,"workersMax":1,"workersMin":0,"executionTimeoutMs":600000,"template":{"containerDiskInGb":25,"containerRegistryAuthId":"","dockerArgs":"","env":[{},{],"imageName":"my-imaage","name":"endpoint-dev (cloned)__template"}}},"query":"mutation saveEndpoint($input: EndpointInput!) {\n saveEndpoint(input: $input) {\n gpuIds\n id\n idleTimeout\n locations\n name\n networkVolumeId\n scalerType\n scalerValue\n templateId\n userId\n workersMax\n workersMin\n gpuCount\n __typename\n }\n}"}
{"operationName":"saveEndpoint","variables":{"input":{"gpuIds":"AMPERE_48,ADA_48_PRO,-NVIDIA A40,-NVIDIA L40","gpuCount":1,"allowedCudaVersions":"","idleTimeout":5,"locations":null,"name":"myendpoint-dev (cloned)","networkVolumeId":null,"scalerType":"QUEUE_DELAY","scalerValue":4,"workersMax":1,"workersMin":0,"executionTimeoutMs":600000,"template":{"containerDiskInGb":25,"containerRegistryAuthId":"","dockerArgs":"","env":[{},{],"imageName":"my-imaage","name":"endpoint-dev (cloned)__template"}}},"query":"mutation saveEndpoint($input: EndpointInput!) {\n saveEndpoint(input: $input) {\n gpuIds\n id\n idleTimeout\n locations\n name\n networkVolumeId\n scalerType\n scalerValue\n templateId\n userId\n workersMax\n workersMin\n gpuCount\n __typename\n }\n}"}
29 replies
RRunPod
Created by zkreutzjanz on 5/3/2024 in #⛅|pods
How to tell how much storage being used in pod? (including network drive)
I try df -h, but it seems to represent the whole filesystem.
(base) root@f3165c77df52:/workspace# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 30G 8.9G 22G 30% /
tmpfs 64M 0 64M 0% /dev
tmpfs 117G 0 117G 0% /sys/fs/cgroup
shm 87G 106M 87G 1% /dev/shm
/dev/vda1 39G 9.3G 30G 24% /usr/bin/nvidia-smi
mfs#eur-no-1.runpod.net:9421 175T 74T 102T 43% /workspace
/dev/vda2 1.7T 735G 925G 45% /etc/hosts
udev 117G 0 117G 0% /dev/tty
tmpfs 117G 12K 117G 1% /proc/driver/nvidia
tmpfs 117G 4.0K 117G 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 24G 1.6M 24G 1% /run/nvidia-persistenced/socket
tmpfs 117G 0 117G 0% /proc/acpi
tmpfs 117G 0 117G 0% /proc/scsi
tmpfs 117G 0 117G 0% /sys/firmware
(base) root@f3165c77df52:/workspace# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 30G 8.9G 22G 30% /
tmpfs 64M 0 64M 0% /dev
tmpfs 117G 0 117G 0% /sys/fs/cgroup
shm 87G 106M 87G 1% /dev/shm
/dev/vda1 39G 9.3G 30G 24% /usr/bin/nvidia-smi
mfs#eur-no-1.runpod.net:9421 175T 74T 102T 43% /workspace
/dev/vda2 1.7T 735G 925G 45% /etc/hosts
udev 117G 0 117G 0% /dev/tty
tmpfs 117G 12K 117G 1% /proc/driver/nvidia
tmpfs 117G 4.0K 117G 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 24G 1.6M 24G 1% /run/nvidia-persistenced/socket
tmpfs 117G 0 117G 0% /proc/acpi
tmpfs 117G 0 117G 0% /proc/scsi
tmpfs 117G 0 117G 0% /sys/firmware
13 replies
RRunPod
Created by zkreutzjanz on 5/1/2024 in #⛅|pods
How to get a general idea for max volume size on secure cloud?
I have been able to deploy 2TB drives, but what is the standard here? How much storage is there generally per server to estimate what i should expect to be able to get?
37 replies
RRunPod
Created by zkreutzjanz on 4/22/2024 in #⛅|pods
NVENC driver conflict
No description
3 replies
RRunPod
Created by zkreutzjanz on 3/17/2024 in #⚡|serverless
A6000 serverless worker is failing for an unknown reason.
In the last week a few of our serverless workers have been failing on all requests. Trying to narrow down a common denominator right now, seems to just be an A6000 issue.
1 replies
RRunPod
Created by zkreutzjanz on 3/14/2024 in #⛅|pods
GPU usage when pod initialized. Not able to clear.
Tried nvidia-smi -r, restarting, and reseting. There is still usage on one gpu in the pod.
5 replies
RRunPod
Created by zkreutzjanz on 2/25/2024 in #⛅|pods
Pod running but inaccessible
No description
1 replies