> We did some rough maths and it’s gonna
We did some rough maths and it’s gonna take 37 days to do soWhat is the math you did? We cannot help otherwise.
3 Replies
We have a table, X, with 4 million rows, and for each row, we need to create a Data Object (DO). The data needed for these DOs is in another table, Y, which has 6 million rows and is around 10GB in size. Each DO will be stored in an SQLite database, where we save all the rows from Y that correspond to an X ID.
The challenge is that if we process everything concurrently, our worker (which only has 128MB of memory) runs out of space. If we do it sequentially, it would take about 4 million seconds (around 46 days), which is too slow.
Our solution is to use two types of workers: one invoker worker and one main worker. The invoker worker calls the main worker concurrently and assigns tasks based on dates. It sends a specific date to the main worker, which processes only that date’s data. The main worker fetches the relevant rows from X and Y and creates DOs accordingly. By splitting the workload by date, each main worker instance only handles data for that specific date, keeping memory usage within the 128MB limit and allowing for concurrent processing. We will probably make it even more granular by instead of using dates, use hours.
Let me know if you have any suggestions to improve this approach!
queues might be worth looking into. It will handle concurrency automatically (up to 250 concurrents) but the only downside is that it will usually slowly scale up the concurrency (it wont scale until finishing a batch). More info about how concurrency is scaled in queues: https://developers.cloudflare.com/queues/configuration/consumer-concurrency/#how-concurrency-works
Cloudflare Docs
Cloudflare Queues · Cloudflare Queues
Cloudflare Queues integrate with Cloudflare Workers and enable you to build applications that can guarantee delivery, offload work from a request, send data from Worker to Worker, and buffer or batch data.
I assume this is a one-time migration right, and there are no writes to the tables X and Y while you are migrating to DOs.
Parallelism is the way you should do this. There are many ways to go about this.
You can write a local script that will be calling a Worker with "start/end" IDs from your X table, and that worker will then process in parallel those rows as long as it's within memory. Your local script will invoke as many Workers as you need to speed up the parallelism.
Another approach would be to use Workflows, and each step processes a chunk of the rows in X, and again invoke your Worker in parallel that does the actual copying of the data.
Using Workflows has the benefit that if your workflow crashes or some step fails, the retry of it will continue from the last step that succeeded.
The idea is that you chunk the X table and then process in parallel as many of those chunks as you can with any approach you want.