Hi All! I've built a FLUX inference container on Runpods serverless. It works (sometimes) but I get a lot of random failures and Runpods does not return me the error logs.
E.g. this is the response: ''' { "delayTime": 151019, "error": "job timed out after 1 retries", "executionTime": 102002, "id": "64de56ee-4af2-4c64-ab84-02d4a7e81593-u1", "retries": 1, "status": "FAILED", "workerId": "1qjtmj861f1278" } '''
But no error log is reported, either in console or in the response, about what made the jobs re-try the first time.
Also the timeout should be one hour but I get this message after a few minutes. I have also added a Telegram bot to log, but no exception is captured there as well. Did the machine just crash?
Have you experienced the same?
Recent Announcements
Continue the conversation
Join the Discord to ask follow-up questions and connect with the community
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!