websocket-upgrade fetch from worker to DO randomly delayed

We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks.

A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a wss request to our ingress worker, whose
fetch
delegates to methods like this:

function synchToSession(request: Request, env: Env, colo: string) {
    if (request.headers.get("Upgrade") === "websocket") {
      const sessionId = request.url.searchParams.get('session');
      const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
      const sessionRunner = env.SESSION_RUNNER.get(runnerId);

      console.log(`worker@${colo}: forwarding websocket`);

      return sessionRunner.fetch(request);
    }
}


...where the "session runner" DO has a
fetch
that boils down to:
async fetch(request: Request): Promise<Response> {
    const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
    ourSocket.accept();
    // ...set up event handlers etc, then...
    return new Response(null, { status: 101, webSocket: clientSocket });


Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds.

The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.

For context:
  • the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved).
  • there is no other significant traffic to our ingress or workers.
How could we at least figure out where the time is going?
Was this page helpful?