Hey @avenceslau | Workflows We have
Hey @avenceslau We have been encountering Workflow instances that occasionally are terminating on their own while running and aren't throwing an error as expected. It seems to occur about once a day now, and it requires manual intervention for us to kick the workflow off again.
Essentially these Workflow instances appear to be receiving a "terminate" signal but we are not the ones initiating them. Any idea on what might be up?
20 Replies
Some examples:
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/77ede742-c001-4d0c-9813-7b6efabc1acf
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/e4f91796-31f8-42db-a693-4f6da6b51c46
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/f4bf7053-3cbc-42a1-821d-8ff2c0d33c81
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/ea4014ca-fa34-4531-b161-2739abad5ebc
Going to investigate, can you give me the time when it happens? (Ideally in UTC)
i think the best timestamp we have is the last end date in each of the workflow:
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/fe7fbb5e-5807-47e7-8db5-6def93dd70d6 2025-10-30T01:18:35.701Z
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/ea4014ca-fa34-4531-b161-2739abad5ebc 2025-10-29T15:02:39.630Z
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/f4bf7053-3cbc-42a1-821d-8ff2c0d33c81 2025-10-28T19:03:01.096Z
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/e4f91796-31f8-42db-a693-4f6da6b51c46 2025-10-24T11:40:31.139Z
https://dash.cloudflare.com/470d4729e23e8936fd2a8f6569770873/workers/workflows/poll-database-workflow/instance/77ede742-c001-4d0c-9813-7b6efabc1acf 2025-10-21T07:57:01.801Z
but it's hard to tell if that end time corresponds exactly with when it self-terminates
Yup thanks that helps.
Thanks for investigating!
What do you mean by are not receiving an error as expected?
Feel free to share with me the workflow via DM
@ajay1495 @andrew | dreamlit.ai sorry for the ping, want to get this fixed for you guys asap if its a bug
Thanks for the follow up, sharing shortly
Here's our Workflow wrapper. Essentially, it wraps all the code in a try catch that allows us to gracefully handle when an error is thrown
Meaning, if an error is thrown, it gets gracefully handled (we log the error and throw in Sentry) before terminating.
What we noticed is that these ongoing polling jobs (which extend that base class), will just spontaneously terminate.
And we can't figure out why, they're not going through the exception handling there, so it must not be through an error being thrown.
You call terminate on the workflow?
We only call terminate explicitly when we want to pause polling.
In those handful of instances above, there was no pausing polling initiated
That would mean we'd pause service for our customer
Also, when we call .terminate() on our end, we log that in our db.
So I checked all this instances and I see calls to binding calls to terminate matching does time stamps.
Could this be a bug on your end?
Is this logging done inside of the workflow that you are terminating?
Terminating a workflow from within itself might lead to undefined behavior
Meaning, if an error is thrown, it gets gracefully handled (we log the error and throw in Sentry) before terminating and that's what I interpreted from here.So we restarted these polling jobs, which calls .terminate() before starting again (because our system thinks the worker is still running, since it never went through the error fallback logic)
So maybe that's what you're seeing on your end?
Timeline is basically:
- PollWorkflow runs, we log the workerId
- spontanenously terminates (which doesn't clear out the workerId in our db since it doesn't go through error fallback logic)
- we notice, and we restart polling
- restarting polling calls .terminate bc it thinks workerId still running based on db, then kicks off new PollWorkflow instance.
We don't terminate a workflow within itself.
Here's the worker code, and only place we .terminate this workflow
Hang on... we are seeing CANCEL logs for these. Maybe this is on our end.
I suspect this is the culprit
(await instance.status()).status === "running" don't know if your workflow sleeps or waitForEvents but if it does it might change state to waiting you would call terminate right? (true for any other state )Actually that would mean we would not call .terminate() based on the logic there if the workflow is sleeping (which is certainly a bug but not the culprit here)
Bad reading on my side, please investigate a bit further, if you find out that this is on us let me know, I will be happy to help
Appreciate it. Checked our session replay for the user it appears they actually initiated this flow based on the path they took.
So this is completely expected, no issue from CF Workflows side.
Many thanks for investigating regardless.
You are welcome 🫡