You need to provide us database IDs to
You need to provide us database IDs to be able to look anything for you.
Also, some details on volume, queries, how much data are you returning each time, etc are helpful.
This is from the HKG PoP and with a D1 instance in APACAPAC is a huge region and you could be 100-200ms away from your database unfortunately. Having said that anything more than a few hundreds of ms even in those cases is a bigger unexpected issue, either with the responses getting back to your worker or something else. Also, I assume you are using the Workers binding right? Not the REST API.
22 Replies
Alright, so to answer those questions:
Yes, it's APAC, however this issue only occurred after the most recent data ingest, so it leads me to believe something underlying in the D1 database is having an issue. Also this issue is occuring across 6 seperate databases, each in a seperate region, which further points me towards it being an issue with D1, as the same behaviour is being exhibited across all our our databases, which all received this ingest in the same way.
Yes, we are using workers and worker bindings.
APAC D1 ID :
d8501ddc-adea-48b4-a577-074640b56493
Before that latest ingest, everything was working smoothly and none of our instances were experiencing any abnormal latency spikes. We were extremely happy with the performance.
@lambrospetrou Tagging just because I am not sure if discord automatically includes you in the thread, even if it was from your message 😅
As for volume of queries, we have around 18k queries in the last 24 hours, each query hits an index, so only reads a single row. The P95 latency for those queries is 0.3msHere's an example of 10 entries from the database.

We simply return the
entry
row for every query, and the query is simply SELECT * FROM cache_entries_20241008 WHERE startTileId <= ? ORDER BY startTileId DESC LIMIT 1
and our primary key is startTileId so it automatically receives an index
This data ingest I believe was completed around 23:00 UTC on 14th MarchYou can see here. These are our metrics for D1 queries going from the worker -> D1 in HKG, I can other regions as examples if you like to. Immediately after this data import happene,ed the database started having these issues on an ongoing basis

This graph is specifically the maximum duration of a query in an hour period just to illustrate the point. Before these changes, everything was fine, there were barely any spikes at all, latencies were sitting around 70ms average for HKG
Just for further evidence, this is from IAD -> ENAM D1

Hey @Zack sorry for the delayed response. I started looking into this issue today. I did find evidence of what you report, and we are trying to find why it's happening just to your database since this is not widespread issue at all. Maybe there is some weird state your database got into.
Can you give me other database IDs that are impacted similarly, so that we can see if there is any correlation among them?
Also, did the other databases also start to be slow at the same time?
Yea, all at the exact same time, right after the import. If you want more detail about that process I can provide you the actual statements we run
Any information would be useful. If it's private things feel free to message me directly. This is very weird.
OC:
403f37a6-ebe8-4d05-9ffb-088aa73fa014
APAC: d8501ddc-adea-48b4-a577-074640b56493
EEUR: 538b5ef8-9a73-4bd8-a088-3a43c992cfd0
WEUR: ce5b6777-c533-4ff1-91a4-d6b5e94293da
ENAM: f63d8430-3e5b-42bd-a6a6-45fd3f552a9a
WNAM: 837aa5e4-f404-4613-b30c-34f3b8bca150
Some more detail on our process. I shared above what the data looks like, every row is essentially exactly the same format as this
This is essentially a lookup table for data to retrieve from an archive in R2, it tells us where specific bits of data are in one huge file in R2, which we then read with range requests.
When we need to update this lookup table, we do the following. This time specifically we had to update the data in the existing table due to a bug in the warming process, which ingests all this data into D1.
1. We create a tmp
table with the same schema as the existing table
2. We ingest all this data, we are doing that via a "proxy worker" to avoid the API rate limits. We do this with a series of individual INSERT
statements into the tmp
table, there are combined up to 90,000 bytes to stay under the 100,000 byte limit for SQL statements.
3. Wait for ingest to complete on all 6 tables in each region
4. In quick succession
4.a. rename the existing table from cache_entries_20241008
to cache_entries_20241008_old
4.b rename the tmp
table to cache_entries_20241008
4.c wait 10 seconds for things to settle
4.d drop cache_entries_20241008_old
(this step usually kills the database, which was a seperate issue, presumably due to the quantity of data being dropped)
The old table and new table are both 3,246,193 rows long and ~2GB in size
If you need any other information at all, let me know 🙂
Not sure if you can actually read the data in the table, but if you want me to provide the actual SQL exports so you can hopefully reproduce this yourself, I can do so
One more note on 4.d.
This query does kill the database and result in a timeout to the caller, but eventually it returns and the database has been dropped. However I do wonder if this could have something to do with the behaviour we're seeing, potentially the DO crashed somewhere in this process leaving the database in some strange state?Thanks for all the info. I will see what we find based on this and ask for more details if we need to.
One thing, are you running these steps as separate migrations through wrangler, or is each step done through the Worker D1 Binding as normal queries?
We have a script that computes all the data, and then we have a seperate worker which is simply accepting SQL queries through HTTP requests and throwing them at the D1 binding. This was to avoid using either the import API (locks the entire database) or the d1 query or raw API endpoints as it's very easy to hit rate limits there
That's the "proxy worker"
Quite literally just a proxy for us to call D1 using bindings at a significantly higher rate than if we use the API directly
Makes sense, thanks for the info.
d1Repository.query just calls
A potential fix/improvement was rolled out last night. I see that since last night (20:00 UTC) your high latencies disappeared and are now at normal levels. Can you confirm?
There is probably going to be occasionally some slow latency due to how infrequent your queries are and having to refetch your database from a snapshot depending on the datacenter health, but you should be seeing much better latencies now.
Can confirm that it looks fixed, that's awesome. Out of interest, can you share what the problem was? 🙂

APAC D1 for the NRT colo
Max query times
Just wanted to ask one more time about this, won't bump again if you aren't able to answer 🙂
Because of the query you were doing taking too long, a timeout was hit causing a reset of the underlying DO. We extended that timeout.
Huh, that's weird considering the query generally takes like 1ms 😅 Appreciate the reply
I don't remember what query it was that was causing the issue or which scenario. If I remember correctly, it was because your database was cold starting often, so it had to interact with our snapshot/cold storage as well.
Hmmm I see, I wouldn't expect that necessarily for our WEUR database, but again, whatever was changed appears to have resolved the problem, so I guess I will just quietly accept that 😅