Job suddenly restarts and fails after one retry.
I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts.
I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights.
Request:
{
"input": {
"task": "train_lora",
"job_id": "dev-test-12",
"animal_type": "dog"
},
"policy": {
"executionTimeout": 3600000,
"ttl": 86400000
},
"webhook": "https://webhook.site/xxx"
}
Respones:
{
"delayTime": 4179,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "IN_PROGRESS",
"workerId": "s8gg7p09azjtqr"
}
{
"delayTime": 394386,
"error": "job timed out after 1 retries",
"executionTime": 61170,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "FAILED",
"workerId": "s8gg7p09azjtqr"
}
9 Replies
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
Hey @Jason!
I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4.
Same request with less training images provided for testing just runs fine.
Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).

Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
i am facing the same issue and i think it's related to timeout because if i amke the worker active the issue does not happen
I having a similar issue but mine doesn't stop after one retry. Mine kills itself before the workflow gets injected then loops infinitely until I actually terminate the worker. Cancelling the job does nothing. I found out the hard way when it ate all of my budget unattended
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View