Inconsistent Pod behavior
I'm running the same exact custom docker image on the same exact instance types. One pod always fails with a CUDA memory error, and the other pod doesn't. I am using the same exact setups for the pods, the same exact inputs, the same everything.
I have tried switching to use multiple different regions of pod, different volumes, etc. Behavior is still the same.
I have tried switching to use multiple different regions of pod, different volumes, etc. Behavior is still the same.