I think this specific GPU has issue. A simple generation took 200 seconds. Usually a GPU will take 20 seconds.
Verified by creating new pod using another GPU, took only 20 seconds
All GPUs used are RTX 6000 Ada. I am creating experiment results, so need to make all variables identical
Additional confirmation (still kicking myself for it, wasting too much time debugging):
1. reinstall python environment. run experiment, still took 200 seconds.
2. reinstall python environment and then remove huggingface cache. run experiment. still took 200 seconds.
3. terminate pod and then immediately create new pod (RTX 6000 Ada is pretty low on stock, so I get the same GPU. I am sure because when I terminate, the GPU become available immediately and when I run new pod, the GPU become no longer available). run experiment, still took 200 seconds
What I am worried is:
1. some experiment might be running on this broken GPU. I noticed that the gradient suddenly have nan and then just hang. However, I don't know which experiment is affected. Pretty much 1 day worth of work need to be shelved (worse, I don't keep track of it, just the number of it. My best clue about the experiment is the date on the log file which thankfully provided by python's logger package)