Job suddenly restarts and fails after one retry.

I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts. I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights. Request: { "input": { "task": "train_lora", "job_id": "dev-test-12", "animal_type": "dog" }, "policy": { "executionTimeout": 3600000, "ttl": 86400000 }, "webhook": "https://webhook.site/xxx" } Respones: { "delayTime": 4179, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "IN_PROGRESS", "workerId": "s8gg7p09azjtqr" } { "delayTime": 394386, "error": "job timed out after 1 retries", "executionTime": 61170, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "FAILED", "workerId": "s8gg7p09azjtqr" }
9 Replies
Unknown User
Unknown User14mo ago
Message Not Public
Sign In & Join Server To View
landingpagelover24
landingpagelover24OP14mo ago
Hey @Jason! I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4. Same request with less training images provided for testing just runs fine. Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.
Unknown User
Unknown User14mo ago
Message Not Public
Sign In & Join Server To View
landingpagelover24
landingpagelover24OP14mo ago
I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).
landingpagelover24
landingpagelover24OP14mo ago
No description
Unknown User
Unknown User14mo ago
Message Not Public
Sign In & Join Server To View
taoufiqlotfi
taoufiqlotfi3mo ago
i am facing the same issue and i think it's related to timeout because if i amke the worker active the issue does not happen
OVYRLORD
OVYRLORD3mo ago
I having a similar issue but mine doesn't stop after one retry. Mine kills itself before the workflow gets injected then loops infinitely until I actually terminate the worker. Cancelling the job does nothing. I found out the hard way when it ate all of my budget unattended
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?