What does it look like for you if a DO

What does it look like for you if a DO gets "stuck", I don't think we have witnessed this behavior yet?

Tero•9/18/25, 11:52 AM

For example they start throwing an internal error code or calls to DO:s start to time out

Tero•9/18/25, 11:53 AM

stuck is not the best word to describe, but all the things many have surely noticed when cloudflare has been having generic issues

MarakOP•9/18/25, 11:54 AM

We've seen the timeouts happen, but only during batch fetch operations.

MarakOP•9/18/25, 11:55 AM

I checked Glassdoor for employee reviews, was a bit of a mixed bag.

Screenshot_2025-09-18_at_12.29.03_AM.png

Tero•9/18/25, 11:56 AM

there was one case of a single DO getting properly stuck persistenly throwing internal error with no generic distrubance ongoing, I was in contact with support as well. It started out of a blue with no changes done by us, but eventually got fixed after I did a new deployment. Never received any official explanation.

TTero there was one case of a single DO getting properly stuck persistenly throwing in...

MarakOP•9/18/25, 11:56 AM

Any ideas what caused this? Was there a large amount of data in the DO? High traffic? Many RPC calls? It would help us in case of future event.

MarakOP•9/18/25, 11:57 AM

We are heavily invested in DOs right now

Tero•9/18/25, 11:58 AM

thats the thing, there was nothing special with the case. We had average traffic, it distributes evenly to several chat room related DO:s and all other instances were fine. My deployment had no migrations or anything that should impact the data the DO had already persisted but somehow magically it still got fixed

TTero thats the thing, there was nothing special with the case. We had average traffic...

MarakOP•9/18/25, 11:59 AM

When you say

it distributes evenly

it distributes evenly

, do you mean you were fetching DO instances from inside a DO using a loop?

MarakOP•9/18/25, 11:59 AM

Or from a worker, fetching DO instances in a loop?

MarakOP•9/18/25, 12:00 PM

I appreciate the insight here, thank you

Tero•9/18/25, 12:03 PM

no nothing in a loop, just that opening web socket connectins from a worker to multiple DO:s (one per room) distributes fairly evenly to multiple rooms. Usually with just tens of connections per DO

MarakOP•9/18/25, 12:05 PM

is the chat history stored in SQLite or KV?

MarakOP•9/18/25, 12:06 PM

we are also running chat rooms backed by DOs

Tero•9/18/25, 12:08 PM

I log the duration of opening the websocket from worker perspective and there we also see a lot of variance. Usually its nice and low. We do execute some logic in the DO when opening the socket, but so far I have not seen any hiccups here, just that the duration spikes occasionally. Some times all rooms, some times only a few. But I do have a strong gut feeling these are related to issues on cloudflare side.

Tero•9/18/25, 12:08 PM

should be KV based as these have existed for a while

MarakOP•9/18/25, 12:08 PM

We've seen spikes in latency when dealing with the DOs, sometimes they respond slowly for no reason

Tero•9/18/25, 12:09 PM

exactly

MarakOP•9/18/25, 12:09 PM

The KV based chat history may be an issue. We migrated away from KV chat, since our implementation had to perform parsing of mid-sized JSON fragments. It was causing some bottlenecks.

Tero•9/18/25, 12:10 PM

We only have last hour of messages in the DO as thats what we show to users (for whatever internal reasons..), we offload elsewhere for persistent storage. So the amount of data is minimal in our case.

MMarak We've seen spikes in latency when dealing with the DOs, sometimes they respond s...

Tero•9/18/25, 12:11 PM

and the frustration for me is there is no metrics or visibility to explain what is happening when this slowness hits

TTero and the frustration for me is there is no metrics or visibility to explain what ...

MarakOP•9/18/25, 12:12 PM

Yeah it's a bit frustrating to see an RPC call which usually operating in sub 300ms to take up 10-15 seconds for no discernible reason.

MarakOP•9/18/25, 12:14 PM

We may end up migrating core chat services to a cluster of dedicated EC2 or alike. The issue is cost; however the cost of having services not be reliable might become too much.

MarakOP•9/18/25, 12:15 PM

Previously we were burning a few hundred a month to keep the cluster up, versus the very cheap cost of the CF DOs.

MarakOP•9/18/25, 12:16 PM

I'm going to give CF the benefit of the doubt until end of this year, or until we see a catastrophic error happen.

Tero•9/18/25, 12:44 PM

Yup we are also holding on for a while still hoping it gets better, but unfortunately feels like things have been going to the opposite direction

What does it look like for you if a DO

Similar Threads

Similar Threads

Similar Threads