[Urgent] One GPU suddenly went away

Hi, we have prod issue right now one of the gpu from our pod suddently disappared
41 Replies
Superintendent
Superintendent5mo ago
fall off the bus?
xPaghkman
xPaghkman5mo ago
@Justin Can someone help, and check our pod ?
Superintendent
Superintendent5mo ago
lspci | grep VGA should spit out something about the gpu also can u run nvidia-smi and show it
xPaghkman
xPaghkman5mo ago
yes, here first gpu got missing
No description
Superintendent
Superintendent5mo ago
so what does lspci | grep VGA spit out?
xPaghkman
xPaghkman5mo ago
cant run it, command not found not sure what to install what to install ?
Superintendent
Superintendent5mo ago
wdym lspci isnt there?
xPaghkman
xPaghkman5mo ago
No
Superintendent
Superintendent5mo ago
run lspci without grep
xPaghkman
xPaghkman5mo ago
No description
Superintendent
Superintendent5mo ago
lspci your missing an i lspci
xPaghkman
xPaghkman5mo ago
No description
Superintendent
Superintendent5mo ago
whar.
xPaghkman
xPaghkman5mo ago
is not this for amd ?
Superintendent
Superintendent5mo ago
wdym your on nvidia gpus so yea it should work
xPaghkman
xPaghkman5mo ago
yea that is what I am double checking if this should work with nvdia
Superintendent
Superintendent5mo ago
sudo apt-get update sudo apt-get install pciutils try that i think its something to do with pciutils
xPaghkman
xPaghkman5mo ago
No description
xPaghkman
xPaghkman5mo ago
worked now
Superintendent
Superintendent5mo ago
what does dmesg spit out
xPaghkman
xPaghkman5mo ago
dmesg: read kernel buffer failed: Operation not permitted
Superintendent
Superintendent5mo ago
sudo !! (Wtf?)
xPaghkman
xPaghkman5mo ago
cant run sudo on Pods
Superintendent
Superintendent5mo ago
oh right docker container uhh.
xPaghkman
xPaghkman5mo ago
yea
Superintendent
Superintendent5mo ago
i have no friggin clue, can u try to restart it?
xPaghkman
xPaghkman5mo ago
will try but moved all production process to the running gpu so if I restart need to run bunch of things again I am just waiting maybe it comes back
Superintendent
Superintendent5mo ago
oof is the 4090 ok with the load?
xPaghkman
xPaghkman5mo ago
we have a internal queue, set it to 1 right now also just realized this is community cloud
Superintendent
Superintendent5mo ago
yes
xPaghkman
xPaghkman5mo ago
thought this was on secure cloud
Superintendent
Superintendent5mo ago
i personally havent had any issues on community tbh is it sd that ur running?
xPaghkman
xPaghkman5mo ago
bunch of things including sd, llm, and more models
Superintendent
Superintendent5mo ago
oh nice u can cram all that into 24gb of vram? or really 48
xPaghkman
xPaghkman5mo ago
Yes, it is really enough to handle e.g 100 active users at a time
Superintendent
Superintendent5mo ago
did the restart fix anything?
xPaghkman
xPaghkman5mo ago
I am working on it. Before that need to set up few scripts to get production running after restarts
Superintendent
Superintendent5mo ago
understood
xPaghkman
xPaghkman5mo ago
And also so weird, runpod keeps charging for 2 gpu even though one is not even running
Superintendent
Superintendent5mo ago
yes
xPaghkman
xPaghkman5mo ago
@Superintendent So weird, gpu came back up while I was preapering things
Want results from more Discord servers?
Add your server
More Posts
Issue with Worker Initiation Error Leading to Persistent "IN_PROGRESS" Job StatusHi All, While testing the endpoint, I observed that when initiating a job with an empty input in thDoes GPU Cloud service support Illyasviel/Fooocus AI?My pc has low vram and always get disconnections from the fooocus ai, im interested to upgrade with Log retention and privacyI'm weighing the cost benefit of cloud GPU for AI inference tasks and self-hosted vs privacy implicaPod suddenly says "0x A100 80GB" and cuda not availableHi, I created a pod a few days ago and worked with it, no problem. I stopped the pod after the sessiServerless doesn't work properly when docker image is committedI built the image locally using the following command and it works fine after submitting it to serveMoving storage locationMy storage drive is in region EU-CZ-1. But there are no pods available to launch. Is there anyway I [Errno 122] Disk quota exceededMy workers are occasionally getting this error, which I've never seen before: { "dt": "2024-01-12Error whilst using Official A1111 Runpod Worker - CUDA error: an illegal instruction was encounteredhttps://github.com/runpod-workers/worker-a1111 I am using the official A1111 Runpod Worker. It's nois your network volume charged by actual usage or the fixed number keyed in during setup?is your network volume charged by actual usage or the fixed number keyed in during setup?Use private image from Google Cloud Artifact RegistryI'm trying to setup the authentication to GCP Artifact Registry, but without much success. I've foll