Cloudflare Developers•9mo ago

You need to provide us database IDs to

You need to provide us database IDs to be able to look anything for you. Also, some details on volume, queries, how much data are you returning each time, etc are helpful.

This is from the HKG PoP and with a D1 instance in APAC

APAC is a huge region and you could be 100-200ms away from your database unfortunately. Having said that anything more than a few hundreds of ms even in those cases is a bigger unexpected issue, either with the responses getting back to your worker or something else. Also, I assume you are using the Workers binding right? Not the REST API.

22 Replies

Zack•9mo ago

Alright, so to answer those questions: Yes, it's APAC, however this issue only occurred after the most recent data ingest, so it leads me to believe something underlying in the D1 database is having an issue. Also this issue is occuring across 6 seperate databases, each in a seperate region, which further points me towards it being an issue with D1, as the same behaviour is being exhibited across all our our databases, which all received this ingest in the same way. Yes, we are using workers and worker bindings. APAC D1 ID : d8501ddc-adea-48b4-a577-074640b56493 Before that latest ingest, everything was working smoothly and none of our instances were experiencing any abnormal latency spikes. We were extremely happy with the performance. @lambrospetrou Tagging just because I am not sure if discord automatically includes you in the thread, even if it was from your message 😅 As for volume of queries, we have around 18k queries in the last 24 hours, each query hits an index, so only reads a single row. The P95 latency for those queries is 0.3ms

Zack•9mo ago

Here's an example of 10 entries from the database.

Zack•9mo ago

We simply return the entry row for every query, and the query is simply SELECT * FROM cache_entries_20241008 WHERE startTileId <= ? ORDER BY startTileId DESC LIMIT 1 and our primary key is startTileId so it automatically receives an index This data ingest I believe was completed around 23:00 UTC on 14th March

Zack•9mo ago

You can see here. These are our metrics for D1 queries going from the worker -> D1 in HKG, I can other regions as examples if you like to. Immediately after this data import happene,ed the database started having these issues on an ongoing basis

Zack•9mo ago

This graph is specifically the maximum duration of a query in an hour period just to illustrate the point. Before these changes, everything was fine, there were barely any spikes at all, latencies were sitting around 70ms average for HKG

Zack•9mo ago

Just for further evidence, this is from IAD -> ENAM D1

lambrospetrouOP•9mo ago

Hey @Zack sorry for the delayed response. I started looking into this issue today. I did find evidence of what you report, and we are trying to find why it's happening just to your database since this is not widespread issue at all. Maybe there is some weird state your database got into. Can you give me other database IDs that are impacted similarly, so that we can see if there is any correlation among them? Also, did the other databases also start to be slow at the same time?

Zack•9mo ago

Yea, all at the exact same time, right after the import. If you want more detail about that process I can provide you the actual statements we run

lambrospetrouOP•9mo ago

Any information would be useful. If it's private things feel free to message me directly. This is very weird.

Zack•9mo ago

OC: 403f37a6-ebe8-4d05-9ffb-088aa73fa014 APAC: d8501ddc-adea-48b4-a577-074640b56493 EEUR: 538b5ef8-9a73-4bd8-a088-3a43c992cfd0 WEUR: ce5b6777-c533-4ff1-91a4-d6b5e94293da ENAM: f63d8430-3e5b-42bd-a6a6-45fd3f552a9a WNAM: 837aa5e4-f404-4613-b30c-34f3b8bca150 Some more detail on our process. I shared above what the data looks like, every row is essentially exactly the same format as this This is essentially a lookup table for data to retrieve from an archive in R2, it tells us where specific bits of data are in one huge file in R2, which we then read with range requests. When we need to update this lookup table, we do the following. This time specifically we had to update the data in the existing table due to a bug in the warming process, which ingests all this data into D1. 1. We create a tmp table with the same schema as the existing table 2. We ingest all this data, we are doing that via a "proxy worker" to avoid the API rate limits. We do this with a series of individual INSERT statements into the tmp table, there are combined up to 90,000 bytes to stay under the 100,000 byte limit for SQL statements. 3. Wait for ingest to complete on all 6 tables in each region 4. In quick succession 4.a. rename the existing table from cache_entries_20241008 to cache_entries_20241008_old 4.b rename the tmp table to cache_entries_20241008 4.c wait 10 seconds for things to settle 4.d drop cache_entries_20241008_old (this step usually kills the database, which was a seperate issue, presumably due to the quantity of data being dropped) The old table and new table are both 3,246,193 rows long and ~2GB in size If you need any other information at all, let me know 🙂 Not sure if you can actually read the data in the table, but if you want me to provide the actual SQL exports so you can hopefully reproduce this yourself, I can do so One more note on 4.d. This query does kill the database and result in a timeout to the caller, but eventually it returns and the database has been dropped. However I do wonder if this could have something to do with the behaviour we're seeing, potentially the DO crashed somewhere in this process leaving the database in some strange state?

lambrospetrouOP•9mo ago

Thanks for all the info. I will see what we find based on this and ask for more details if we need to. One thing, are you running these steps as separate migrations through wrangler, or is each step done through the Worker D1 Binding as normal queries?

Zack•9mo ago

We have a script that computes all the data, and then we have a seperate worker which is simply accepting SQL queries through HTTP requests and throwing them at the D1 binding. This was to avoid using either the import API (locks the entire database) or the d1 query or raw API endpoints as it's very easy to hit rate limits there

    if (workerEnv.WORKER_TYPE === 'D1_PROXY') {
      const metrics = new CloudflareMetricsRepository('tiles', request, []);
      const body = await request.json() as { sql: string, db: string };
      const d1Repository = new CloudflareD1Repository({d1: workerEnv[`D1_${body.db}` as unknown as keyof WorkerEnv] as D1Database }, metrics);
      const response = await d1Repository.query(body.sql);
      return new Response(JSON.stringify(response));
}

    if (workerEnv.WORKER_TYPE === 'D1_PROXY') {
      const metrics = new CloudflareMetricsRepository('tiles', request, []);
      const body = await request.json() as { sql: string, db: string };
      const d1Repository = new CloudflareD1Repository({d1: workerEnv[`D1_${body.db}` as unknown as keyof WorkerEnv] as D1Database }, metrics);
      const response = await d1Repository.query(body.sql);
      return new Response(JSON.stringify(response));
}

That's the "proxy worker" Quite literally just a proxy for us to call D1 using bindings at a significantly higher rate than if we use the API directly

lambrospetrouOP•9mo ago

Makes sense, thanks for the info.

Zack•9mo ago

d1Repository.query just calls

        const resp = await database
          .prepare(query)
          .bind(...values)
          .run();

        const resp = await database
          .prepare(query)
          .bind(...values)
          .run();

lambrospetrouOP•9mo ago

A potential fix/improvement was rolled out last night. I see that since last night (20:00 UTC) your high latencies disappeared and are now at normal levels. Can you confirm? There is probably going to be occasionally some slow latency due to how infrequent your queries are and having to refetch your database from a snapshot depending on the datacenter health, but you should be seeing much better latencies now.

Zack•9mo ago

Can confirm that it looks fixed, that's awesome. Out of interest, can you share what the problem was? 🙂

Zack•9mo ago

Zack•8mo ago

APAC D1 for the NRT colo Max query times Just wanted to ask one more time about this, won't bump again if you aren't able to answer 🙂

lambrospetrouOP•8mo ago

Because of the query you were doing taking too long, a timeout was hit causing a reset of the underlying DO. We extended that timeout.

Zack•8mo ago

Huh, that's weird considering the query generally takes like 1ms 😅 Appreciate the reply

lambrospetrouOP•8mo ago

I don't remember what query it was that was causing the issue or which scenario. If I remember correctly, it was because your database was cold starting often, so it had to interact with our snapshot/cold storage as well.

Zack•8mo ago

Hmmm I see, I wouldn't expect that necessarily for our WEUR database, but again, whatever was changed appears to have resolved the problem, so I guess I will just quietly accept that 😅

Gaming

Programming

You need to provide us database IDs to

Did you find this page helpful?