websocket-upgrade fetch from worker to DO randomly delayed
We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks.
A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a
...where the "session runner" DO has a
Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds.
The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.
For context:
A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a
wss request to our ingress worker, whose fetch delegates to methods like this:...where the "session runner" DO has a
fetch that boils down to:Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds.
The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.
For context:
- the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved).
- there is no other significant traffic to our ingress or workers.