Long-standing websockets & writing to external DB subrequest limit?

Is it not possible to make a proper real-time app in Workers wherein the server / durable object might stay alive for a couple hours? We have an to support students in a virtual classroom. Basically, students submit questions for help during an office hours session. Other students can upvote questions. Teaching assistants can select a question to answer live. That should send a notification to all of the students who upvoted the question. The students hop into a separate video call platform to hear the answer. So I was hoping to do this with websockets (or some kind of long-polling) so that students can receive a notification when their question is being answered. If we were going to do websockets, it seemed like it would make sense to keep the connection open for student upvoting / posting new questions as well. But there seems to be a limit to # of external subrequests on a worker (e.g. writing to a DB).
33 Replies
DaniFoldi
DaniFoldi•8mo ago
So, you're right, there is a subrequest limit, and calling into a DB is considered one. However, DOs also have their limit, which is reset to 1000 whenever a new message arrives. So unless you have 1k+ clients you need to fan out to, you can use a single DO (memory limit permitting) to have 1k hibernated websocket connections (so you're not paying for idle time), and iterate through every one of them to forward them the message If you're over 1000 (or lower if you also read from DB, send logs to another service, etc etc), you'll need a multi-level coordinator, where there's a "root" DO, that connects to other DOs (I recommend regular http call to the DO binding, as client websocket connections can't be hibernated), and each "leaf" DO connects to 1k clients to relay messages further
Nikil
Nikil•8mo ago
Thanks for the resonse! The durable object also times out at some point, right? let's say I have 10 different courses. I could use 1 DO per course. So 1k client per course simultaneously, right? Sounds like I won't need the orchestrator for a while since right now at ~60 students + TAs
DaniFoldi
DaniFoldi•8mo ago
why don't you have 1 DO / course? then you can scale horizontally to an infinite number of courses it is evicted from memory when there are no requests to handle for at least 30 seconds - but with hibernated websockets, the connections will revive the DO when necessary you just lose in-memory state
Nikil
Nikil•8mo ago
ok, so clients don't need to re-do handshake all over again
DaniFoldi
DaniFoldi•8mo ago
nope, they can stay connected and you can still send messages to them
Nikil
Nikil•8mo ago
so is it possible that the hibernation happens before I can finish a DB write? I guess 30s is a long time for that purpose can you help me think thru the subrequest limit? So if I have 100 questions get asked in a given 30 seconds, and I want to write all of them to a db (that's not D1) as soon as they're asked, will I hit the subrequest limit?
DaniFoldi
DaniFoldi•8mo ago
that'll be fine, since the limit is reset to 1000 whenever a new message arrives, so for each message, you have a fresh subrequest limit that you can use for DB & notifying other clients
milan
milan•8mo ago
hibernation happens if you are not handling a request, this includes waiting on IO. If your DO is completely idle (i.e. all requests are done and nothing is happening) then we hibernate also the hibernation timeout is 10 seconds 30 second timeout is the CPU per request, I think we actually evict your DO and drop your connections if you surpass this?
DaniFoldi
DaniFoldi•8mo ago
do you have plans to introduce DO-to-DO hibernation? or in general for outgoing websockets?
milan
milan•8mo ago
Not to stray too far from this thread, but our current focus is on JS RPC for DOs (and getting hibernation GA). Once that's done, there's a bit of open space to work on other stuff and I'll see if we can get some work in this/the next quarter. I would personally like to see general outbound websocket hibernation for DOs, but we aren't sure how much work it will be just yet. We have some hibernatable websocket improvements coming soon. hopefully as of next wrangler/miniflare release local dev will actually hibernate DOs Also there's an open PR for the getTags() stuff you asked for a while back, it's just stalled because we've been prioritizing other work :/
DaniFoldi
DaniFoldi•8mo ago
Thanks for the insight, looking forward to the rpc system 👀
eeswar
eeswar•8mo ago
Thanks for the advice milan! Sorry if I'm repeating something you already answered, but we're running into the issue of making too many subrequests to our external db to constantly refresh our database and the questions rendered to users. The current code is setup so as long as there is one active websocket connection we will continue to fetch questions from our db. I can't seem to think of a solution other than simply using a dedicated server to constantly fetch and distribute information across all active connections. Can I make this work using cloudflare?
Nikil
Nikil•8mo ago
^my dev to be clear
milan
milan•8mo ago
@Matt may know better, afaik you get 1000 subrequests per incoming request to your DO, if you're going over 1k you either need more requests to your DO or you need another way to talk to your DB from your DO (such as having an outbound websocket to another service, and the other service talks to your DB). @Dani Foldi do you happen to know if you get 1000 subrequests per inbound websocket message?
DaniFoldi
DaniFoldi•8mo ago
I believe that is the case, yes, they act the same as http messages, waiting for the input gate and resetting the subrequest limit The sounds very odd to me, it should work unless you send to 1k+ other users - do you have a code example to share?
elithrar
elithrar•8mo ago
We have folks with thousands of active WS clients connected to a single DO. I'm a little confused as to the problem here?
eeswar
eeswar•8mo ago
Yep active WS clients isn't the problem. We're just making fetch requests every 2 seconds to an external db in an internal function which ends up exceeding the limit pretty quickly. I'm new to workers and durable objects so my code logic might not be sound, let me share the file
eeswar
eeswar•8mo ago
milan
milan•8mo ago
I'm kind of guessing here but I think you can get around the subrequest limit by: 1. Make a new DO, DB_DO 2. Have your current DO (client_DO) connect to DB_DO over websocket 3. Every time you want to hit your DB, send a ws message from client_DO -> DB_DO 4. This should circumvent subrequest issues because each time DB_DO receives a websocket message, its subrequest count is reset and you get 1k The downside is is your client_DO will have an outbound WS connection, so it will be pinned to memory. But judging by the fact y'all are already doing a fetch every 2 seconds I don't think it matters much
DaniFoldi
DaniFoldi•8mo ago
*I'd actually approach from the other side do you actually need to fetch every 2 seconds, if you can guarantee that you are always notified of every update (by websocket messages)?
eeswar
eeswar•8mo ago
This is what I was curious about, right now we make updates to our db by direct http post calls by hitting backend api routes. I was considering making it so the websocket connection and the worker handles any post calls or even just sending a plain message over websocket connection anytime we hit a backend api route, but our issue is our db has roughly 6 seconds of latency between the time we update and the time we're able to fetch the updated information Oh this is a pretty nifty solution. Would this be expensive to do?
DaniFoldi
DaniFoldi•8mo ago
to me it sounds like the ideal pattern for you is to have 1 DO per course/session/class have every user connect over a hibernated websocket whenever a user sends a message, you can - save to db (1 call) - broadcast to everyone else (n calls) no polling needed, and I don't think you'd go over 1k subrequests or is there any part of your system that I missed?
eeswar
eeswar•8mo ago
This sounds like a pretty great solution, I am worried about there being a difference between the information in the database and the overall information broadcasted to everyone. Should I be comparing the both every few minutes or so?
Nikil
Nikil•8mo ago
Does #4 mean that the cost will balloon?
milan
milan•8mo ago
The DO would stay active all the time, which if it isn't already then your duration charges would increase. Another solution would be to use Alarms to make your outbound DB call every n seconds.
Alarms · Cloudflare Durable Objects docs
Durable Objects alarms allow you to schedule the Durable Object to be woken up at a time in the future. When the alarm’s scheduled time comes, the …
milan
milan•8mo ago
Teaching assistants can select a question to answer live. That should send a notification to all of the students who upvoted the question.
If the TA picks a question, does that go through your DO too, or does that only modify your DB (which is why you're reading from the DB so frequently)? You have a bunch of students connected to your DO with websockets, how does the TA picking a question get propagated to them?
Nikil
Nikil•8mo ago
@eeswar so this is basically a chron job, right? is this essentially a different worker that fetches from the db and submits to the web socket worker & DO so that it can propagate to all attached clients? (Basicall the db worker is another client)
milan
milan•8mo ago
Kind of, it just runs in your DO but should refill your subrequest quota each time the alarm runs. Inside the alarm handler you can do your fetch to the DB, etc. I still think Dani's idea is what you want though, without having a better understanding of your system.
eeswar
eeswar•8mo ago
Haven't built out notifications yet, but yeah it would send a notification over websocket connection to any students linked to a question when a TA assigns themselves to a question. I like Dani's solution, I'm just unsure on how to make sure the visual information for the clients and information in the db are always aligned, feels like there'd be a fair bit of edge cases to be wary about.
aarhus
aarhus•8mo ago
We hit the limit in the past, there are two quick fixes: spilt the jobs in the batches and then send to the DO itself via a web socket connection - you can tweak the batch size appropriately. The other option is queues - this has the benefit that some of the processing can be done outside of the DO and the result of the processing then updates the state in the DO and triggers anything further
magicgregz
magicgregz•8mo ago
Just reading this thread - This is interesting. If i understand correctly, a DO with an outbound connection will remain pinned to memory, bypassing the 30sec limitation? I was under the impression that the only solution to keep a DO running more than 30sec in mem was to use the alarm trick.
DaniFoldi
DaniFoldi•8mo ago
Documentations sometimes says it should, however speaking from experience, that's not something I'd rely on, as it did get evicted every now and then, without any warning, and not following any pattern, so I don't even have a workaround for that, unfortunately :/ (other than keeping the DO alive with periodic alarms)
magicgregz
magicgregz•8mo ago
thanks for clarification!