there was one case of a single DO getting properly stuck persistenly throwing internal error with no generic distrubance ongoing, I was in contact with support as well. It started out of a blue with no changes done by us, but eventually got fixed after I did a new deployment. Never received any official explanation.
thats the thing, there was nothing special with the case. We had average traffic, it distributes evenly to several chat room related DO:s and all other instances were fine. My deployment had no migrations or anything that should impact the data the DO had already persisted but somehow magically it still got fixed
no nothing in a loop, just that opening web socket connectins from a worker to multiple DO:s (one per room) distributes fairly evenly to multiple rooms. Usually with just tens of connections per DO
I log the duration of opening the websocket from worker perspective and there we also see a lot of variance. Usually its nice and low. We do execute some logic in the DO when opening the socket, but so far I have not seen any hiccups here, just that the duration spikes occasionally. Some times all rooms, some times only a few. But I do have a strong gut feeling these are related to issues on cloudflare side.
The KV based chat history may be an issue. We migrated away from KV chat, since our implementation had to perform parsing of mid-sized JSON fragments. It was causing some bottlenecks.
We only have last hour of messages in the DO as thats what we show to users (for whatever internal reasons..), we offload elsewhere for persistent storage. So the amount of data is minimal in our case.
We may end up migrating core chat services to a cluster of dedicated EC2 or alike. The issue is cost; however the cost of having services not be reliable might become too much.