Containers silently charge users while stuck in infinite CUDA compatibility failure loop
When containers are failing to initialize due to CUDA version compatibility issues, these failures are stuck in an infinite loop & are being charged to the user's account without proper visibility or error reporting in the UI. Additionally, queued requests for failing containers remain in queue indefinitely instead of being failed immediately.
Expected Behavior:
- Container initialization failures should not incur charges
- CUDA version compatibility errors should be clearly displayed in the UI or at least in the logs (out of 5 workers in such state I was able to see logs only for 1 of them)
- Failed initialization attempts should be visible in spending/usage reports
- Requests queued for containers that fail to initialize should be immediately failed/rejected
Actual Behavior:
- Containers repeatedly fail to start with CUDA version requirement errors
- Account balance decreases
- Spending rate shows as $0 in UI (global spend rate, not per worker - per worker shows correct amount) while money is actually being deducted
- Error messages are only visible in worker logs & only for some workers, not in the main interface
- Requests remain queued for failing containers with huge delay times but no execution time and no errors
Steps to Reproduce:
1. Deploy container with CUDA requirement higher than available on workers
2. Container fails to initialize with "nvidia-container-cli: requirement error: unsatisfied condition: cuda>=X.X"
3. Submit requests - they remain in queue indefinitely
4. Check account balance - money has been deducted for failed initialization attempts
I was affected by this issue and was charged approximately 15x overhead due to repeated failed container initialization attempts while my requests sat in queue indefinitely. I request a refund.
2 Replies
@Augenbrauensenker
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #20174
Unknown User•4mo ago
Message Not Public
Sign In & Join Server To View