Hey @avenceslau | Workflows We have

Hey @avenceslau We have been encountering Workflow instances that occasionally are terminating on their own while running and aren't throwing an error as expected. It seems to occur about once a day now, and it requires manual intervention for us to kick the workflow off again. Essentially these Workflow instances appear to be receiving a "terminate" signal but we are not the ones initiating them. Any idea on what might be up?
20 Replies
avenceslau
avenceslau•4w ago
Going to investigate, can you give me the time when it happens? (Ideally in UTC)
avenceslau
avenceslau•4w ago
Yup thanks that helps.
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Thanks for investigating!
avenceslau
avenceslau•4w ago
What do you mean by are not receiving an error as expected? Feel free to share with me the workflow via DM @ajay1495 @andrew | dreamlit.ai sorry for the ping, want to get this fixed for you guys asap if its a bug
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Thanks for the follow up, sharing shortly
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Here's our Workflow wrapper. Essentially, it wraps all the code in a try catch that allows us to gracefully handle when an error is thrown
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Meaning, if an error is thrown, it gets gracefully handled (we log the error and throw in Sentry) before terminating. What we noticed is that these ongoing polling jobs (which extend that base class), will just spontaneously terminate. And we can't figure out why, they're not going through the exception handling there, so it must not be through an error being thrown.
avenceslau
avenceslau•4w ago
You call terminate on the workflow?
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
We only call terminate explicitly when we want to pause polling. In those handful of instances above, there was no pausing polling initiated That would mean we'd pause service for our customer Also, when we call .terminate() on our end, we log that in our db.
avenceslau
avenceslau•4w ago
So I checked all this instances and I see calls to binding calls to terminate matching does time stamps. Could this be a bug on your end? Is this logging done inside of the workflow that you are terminating? Terminating a workflow from within itself might lead to undefined behavior Meaning, if an error is thrown, it gets gracefully handled (we log the error and throw in Sentry) before terminating and that's what I interpreted from here.
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
So we restarted these polling jobs, which calls .terminate() before starting again (because our system thinks the worker is still running, since it never went through the error fallback logic) So maybe that's what you're seeing on your end? Timeline is basically: - PollWorkflow runs, we log the workerId - spontanenously terminates (which doesn't clear out the workerId in our db since it doesn't go through error fallback logic) - we notice, and we restart polling - restarting polling calls .terminate bc it thinks workerId still running based on db, then kicks off new PollWorkflow instance.
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
We don't terminate a workflow within itself. Here's the worker code, and only place we .terminate this workflow
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Hang on... we are seeing CANCEL logs for these. Maybe this is on our end.
avenceslau
avenceslau•4w ago
I suspect this is the culprit (await instance.status()).status === "running" don't know if your workflow sleeps or waitForEvents but if it does it might change state to waiting you would call terminate right? (true for any other state )
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Actually that would mean we would not call .terminate() based on the logic there if the workflow is sleeping (which is certainly a bug but not the culprit here)
if ((await instance.status()).status === "running") {
console.log(
`Client database connection ${clientDatabaseConnection.id} is being polled by worker ${clientDatabaseConnection.processingWorkerId}. Sending terminate...`,
);

try {
await instance.terminate();
if ((await instance.status()).status === "running") {
console.log(
`Client database connection ${clientDatabaseConnection.id} is being polled by worker ${clientDatabaseConnection.processingWorkerId}. Sending terminate...`,
);

try {
await instance.terminate();
avenceslau
avenceslau•4w ago
Bad reading on my side, please investigate a bit further, if you find out that this is on us let me know, I will be happy to help
ajay | dreamlit.ai
ajay | dreamlit.aiOP•4w ago
Appreciate it. Checked our session replay for the user it appears they actually initiated this flow based on the path they took. So this is completely expected, no issue from CF Workflows side. Many thanks for investigating regardless.
avenceslau
avenceslau•4w ago
You are welcome 🫡

Did you find this page helpful?