Let me give an example use case that
Let me give an example use case that demonstrates the issue then.
Let's say we are building a cloud storage product, where customers can upload files, and each customer is billed for the total size of all their uploads.
Let's build this with DO, where each customer has their own DO which stores the information of total size in SQLite storage. On a request for uploading a file, the DO would:
- Upload the file to R2.
- Increase the total size in storage.
- Return a success response.
Now here's an example where this would fail:
- The upload to R2 goes through.
- The total size increment goes through.
- The success response is returned.
- DO holds the response until the increment write gets confirmed (output gate).
- The write confirmation failed.
- SQLite storage does not get changed, an error response is returned instead, yet the R2 upload has already gone through, and at this point the DO code has no opportunity to undo the upload.
23 Replies
I should have started the thread a few back, but...
This indeed is an example where you could end up in an inconsistent state but it's not a problem with DOs. It's true with any system you'd design on any platform. The typical workaround is to rollback the R2 upload. An alternative might be a 2-phase commit, but again, that reduces the chances of inconsistent state (often enough to live with), but doesn't eliminate it.
That's the problem, there's no way to rollback the upload, because the output gate happens after your DO returns the final response.
I mean, manually roll it back when your increment write fails.
The increment write doesn't fail, the output gate fails at confirming the write.
Then await the write inside a transaction in the DO.
It doesn't work like that, not according to the blog post. Any SQLite writes (other than stuffs like SQL syntax error and what not) succeed and return to the DO code immediately, they do not actually write to the storage.
Only at the output gate which happens after all of your code has run to completion, the writes then all get processed and being confirmed.
So if output gate fails to confirm your writes to the storage, your code has completed running already and there's no try catch or any error handling you can do.
I'm willing to be shown that this is wrong, but if you wrap your code in a transaction, it will give you a chance to do something after a write fails.
What I'm saying is to not rely upon output gates
I think you are missing the point, it's not the SQLite execution that's failing, it's the actual write to the storage that's failing. The write to the storage does not happen there, it happens at the output gate after your DO code has returned.
Unless I'm missing something that
transactionSync
creates an output gate and doesn't return until said output gate confirms the write.Now I understand the diffence in your perspective and I don't know which is correct but I thought that transactionSync is slower but waits on successful write.
I don't think it does, at least I'm under the impression that it's only for handling SQLite level of failures, not the storage write level of failures.
I disagree but even if I'm wrong, there is
sync()
which "Synchronizes any pending writes to disk." https://developers.cloudflare.com/durable-objects/api/storage-api/#syncCloudflare Docs
Durable Object Storage
The Durable Object Storage API allows Durable Objects to access transactional and strongly consistent storage. A Durable Object's attached storage is private to its unique instance and cannot be accessed by other objects.
I'm actually not sure how that interacts with the SQLite backend, would be nice if someone can confirm.
But if that does do what we think it does, then yeah that would solve the issue nicely.
Even then, I would say most people don't realize this is even a problem, because I have not seen people bring this up despite how potentially big of a problem it is.
I definitely don't see it being said anywhere that "if you interact with the outside world and need to rollback in case of a write failure, make sure you do
await sync()
and catch then handle the failure case."most people don't realize this is even a problem, because I have not seen people bring this up despite how potentially big of a problem it is.Yes, exactly!!! That was the point of my blog post three years ago. Today, I do something similar to the example in that blog post using
transactionSync()
, but I've been assuming it did the equivalent of sync()
inside of the transaction. I may just add that to be absolutely sure.And I mean, this seems like a potentially big footgun that should be discussed in depth at least somewhere in the docs.
Agreed
If I didn't read the blog post and think it over thoroughly, and simply going by what the docs say, I would've never discover this, and seemingly no one else in the community talks about it.
Or if it was talked about, the visibility is clearly not enough.
When I wrote the post 3 years ago, there was a pretty big discussion and others brought up similar worries, but it doesn't happen as often now-a-days
The impact does hinge on how often output gate fails to confirm writes.
I haven't shipped anything to prod yet on the SQLite engine, but when I do, I'll do another blog post about "maintaining consistent state"
Yeah, that's a factor in the tradeoff analysis. Another is the cost of transactions. For my old solution it was up to a 30% hit on latency under peak load. At normal load though it was only a few percent hit on latency. I haven't benchmarked my new solution (or published it yet) but will benchmark before I do.
I'm also not sure that my counterexample does achieve inconsistent state with the SQLite backed storage. I will confirm that also before publishing code and blog.
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
Thanks!!!
Hmm, in @Burrito's scenario, the upload to R2 was initiated before the SQL write, so it would not be blocked by the output gate. @Burrito is saying that when using a traditional database, it would be possible to react to the transaction failing by deleting the object in R2, thus "rolling back" the upload. In DO, there is no way to run code that reacts to a failure; the DO simply disappears.
However, there is actually a problem even on traditional architectures: The application could crash (or suffer a machine failure, network failure, etc.) after uploading to R2 but before updating the database. This would lead to the same problem: The upload stays in R2 despite the database never having been updated. So in fact this problem exist either way.
The solution is as @Frederik says: First store into the databsae some indication that you intend to upload to R2, including the object name you will be uploading. Then upload to R2. Then update the database to say the upload succeeded. If a crash happens somewhere in between, then the next time the app starts up, it can see that there was an incomplete upload intent stored to the database. It can then go check if the blob landed in R2 and, if so, delete it. This approach works on both traditional databases and with output gates.
Thank you both for the detailed responses, I think I have a pretty good understanding of it now:
- There is no explicit API to trigger an output gate, only implicitly via outgoing messages.
- Output gate failures are not catchable, instead they should be treated the same as other failures such as power outage in the middle of execution.
- The example I gave even when run in a traditional infrastructure would fail in case of power outages.
- The solution to properly handle it would be the same for both in traditional infrastructure or in DO.