What does it look like for you if a DO
What does it look like for you if a DO gets "stuck", I don't think we have witnessed this behavior yet?
16 Replies
For example they start throwing an internal error code or calls to DO:s start to time out
stuck is not the best word to describe, but all the things many have surely noticed when cloudflare has been having generic issues
We've seen the timeouts happen, but only during batch fetch operations.
I checked Glassdoor for employee reviews, was a bit of a mixed bag.

there was one case of a single DO getting properly stuck persistenly throwing internal error with no generic distrubance ongoing, I was in contact with support as well. It started out of a blue with no changes done by us, but eventually got fixed after I did a new deployment. Never received any official explanation.
Any ideas what caused this? Was there a large amount of data in the DO? High traffic? Many RPC calls? It would help us in case of future event.
We are heavily invested in DOs right now
thats the thing, there was nothing special with the case. We had average traffic, it distributes evenly to several chat room related DO:s and all other instances were fine. My deployment had no migrations or anything that should impact the data the DO had already persisted but somehow magically it still got fixed
When you say
it distributes evenly
, do you mean you were fetching DO instances from inside a DO using a loop?
Or from a worker, fetching DO instances in a loop?
I appreciate the insight here, thank youno nothing in a loop, just that opening web socket connectins from a worker to multiple DO:s (one per room) distributes fairly evenly to multiple rooms. Usually with just tens of connections per DO
is the chat history stored in SQLite or KV?
we are also running chat rooms backed by DOs
I log the duration of opening the websocket from worker perspective and there we also see a lot of variance. Usually its nice and low. We do execute some logic in the DO when opening the socket, but so far I have not seen any hiccups here, just that the duration spikes occasionally. Some times all rooms, some times only a few. But I do have a strong gut feeling these are related to issues on cloudflare side.
should be KV based as these have existed for a while
We've seen spikes in latency when dealing with the DOs, sometimes they respond slowly for no reason
exactly
The KV based chat history may be an issue. We migrated away from KV chat, since our implementation had to perform parsing of mid-sized JSON fragments. It was causing some bottlenecks.
We only have last hour of messages in the DO as thats what we show to users (for whatever internal reasons..), we offload elsewhere for persistent storage. So the amount of data is minimal in our case.
and the frustration for me is there is no metrics or visibility to explain what is happening when this slowness hits
Yeah it's a bit frustrating to see an RPC call which usually operating in sub 300ms to take up 10-15 seconds for no discernible reason.
We may end up migrating core chat services to a cluster of dedicated EC2 or alike. The issue is cost; however the cost of having services not be reliable might become too much.
Previously we were burning a few hundred a month to keep the cluster up, versus the very cheap cost of the CF DOs.
I'm going to give CF the benefit of the doubt until end of this year, or until we see a catastrophic error happen.
Yup we are also holding on for a while still hoping it gets better, but unfortunately feels like things have been going to the opposite direction