thelunz
thelunz
CDCloudflare Developers
Created by thelunz on 11/28/2024 in #workers-help
Worker log files randomly include batches of entries with timestamps hours or days earlier
When fetching logs through the API, we often see hundreds of entries with timestamps earlier than the time period we requested.
On investigation, we found that the individual files in our log buckets (each nominally holding the items for about 1 minute) sometimes include batches of items that are stamped with times that are way outside the file's supposed range. Sometimes 100 or more such items in a row. Notably, the time stamps of these items typically proceed backwards - for example, the first out-of-sequence item is 30 minutes before the file's nominal start, and the hundredth item in the batch is 120 minutes before. On the scale of a whole day - with hundreds of thousands of log items in total - a thousand or fewer being out of sequence is usually of little consequence. But when fetching the logs from a narrower time range to diagnose a user's problem, the fact that some items are being filed away in entirely the wrong part of the bucket means that we can never be sure that a fetch has retrieved every item that falls within the requested range. This can make diagnosis extremely tricky. Is this a known issue? A known feature of logpush?
2 replies
CDCloudflare Developers
Created by thelunz on 7/16/2024 in #workers-help
websocket-upgrade fetch from worker to DO randomly delayed
We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks. A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a wss request to our ingress worker, whose fetch delegates to methods like this:
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
...where the "session runner" DO has a fetch that boils down to:
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds. The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.
For context: * the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved). * there is no other significant traffic to our ingress or workers. How could we at least figure out where the time is going?
9 replies