Inconsistent Pod behavior
I'm running the same exact custom docker image on the same exact instance types. One pod always fails with a CUDA memory error, and the other pod doesn't. I am using the same exact setups for the pods, the same exact inputs, the same everything.
I have tried switching to use multiple different regions of pod, different volumes, etc. Behavior is still the same.
2 Replies
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
My own custom docker image app
Same exact code, same everything, I've had the code running for weeks with no issue and never crashed, so it's not the code. The second I tried to run multiple of the same image across multiple pods, one of the pods always crashes out with the CUDA memoery error.