Wanna get some thoughts on how to handle
Wanna get some thoughts on how to handle this use case:
I need to synchronise a a countup timer between two types of clients. Currently, the clients poll an API which gives a datetime to signify the timer start, when field is changed then timer ticks up second by second on each client independently. Over a 30 min period these two diff client types drift up to a couple mins from each other, due to differences in how each works. If both clients were to connected to a hibernated DO WS, what would be the best way to keep a timer synced? Call an alarm() every ~1000ms and send the current time value, or continue to let each client handle on its own? Can I rely on a self-setting alarm to be 'roughly' accurate over a particular time period (for my use case, within 10 seconds drift over 30 mins would be acceptable)
Not sure if this explanation makes any sense at all, I might try and make a diagram later lol
3 Replies
Hi, we had something similar, with the interval between alarms being around 4 sec. Roughly 99.x % it was really accurate (often down to exact millisecond) but when the platform was under any "stress" we would occasionally see over 10 sec drifts which were not acceptable for the use case). If you run it 24/7 you will also experience every single hiccup that happens. Never got the detailed explanation, but gut feeling was that they were related to storage acting up.
We ended up refactoring the solution away from alarms, and instead moving the scheduling logic to aws step functions which anyway were generating the data that determines the schedule of sending the websocket messages. Step Functions only have up to 1 second accuracy (and even that does not always hold), but we can send the events to a worker 2 sec early (which in turns calls a DO) and then use setTimeout() with ctx.waitUntil() to send at exactly the right time while unblocking the DO.
After the refactoring reliability has been solid.
Also bumped into this issue which was discussed a short while ago in the channel https://discord.com/channels/595317990191398933/773219443911819284/1404363990624243834
(some discord funkyness with the link, does not point to correct message if I just click it, but if I open in new tab it works)
Just to be clear, my initial solution with time drift is not based on websockets at all, and specifically isn't to be blamed on DOs in any meaningful way. Im mostly looking for guidance on how to manage a second-accurate timer between multiple clients with websockets
Were you using those DO instances for any operations at all besides the actual alarm handler, while the handlers were running? I would assume that that would cause some amount of drift over time. I would use distinct DO namespace for the timer separate from any other DO work I'm doing
No previously the DO would only receive a "full schedule" of 70 or so events to be delivered, store that in state and then schedule alarms one by one for each event. And the next full schedule would arrive only after previous schedule had been fully executed so there should not have been anything blocking the DO when it was time for the alarm to execute.
Also kind of confirming that this was not a application issue, but more of an infrastructure issue was that we were experiencing the issues across multiple cloudflare accounts around the same time (we have our test and prod environments running in separate accounts)