Best way to ingest millions of records into Supabase?
Hey all,
I’m working on a project where we ingest U.S. Federal Procurement Data System (FPDS) contract data into Supabase. The source is the FPDS Atom feed (XML), which we parse and store into a table.
Right now we already have ~2.8M records ingested successfully, but the process is hitting limits as we scale up to larger departments. The main challenges are:
• Supabase Edge Function runtime (23s cap). Works fine for small slices, but long-running pulls time out.
• Unstable FPDS API. It often times out or drops the connection, which breaks ingestion.
• No automatic resume. If a run fails, we either lose progress or need to re-trigger manually.
What I’m trying to figure out is the most efficient and scalable way to handle this kind of workload in Supabase. Specifically:
• Is it realistic to manage a dataset of this scale (millions of records) fully within Supabase Edge Functions?
• How do people usually approach resuming and recovery when the upstream API is flaky and fails for many reasons [timeout, gateway errors, etc]?
• For very large backfills, is the common practice to stay entirely within Supabase, or to offload the ingestion to an external worker service and only use Supabase for storage?
I’d love to hear from others who’ve dealt with big/unstable external data sources and bulk ingestion in Supabase. What patterns have worked well for you?
Thanks in advance!
1 Reply
A bit of a side note I do not know the FPDS API or system or w/e is being processed there. So my thoughts may not be helpful/viable.
1. Realistic? - While Supabase Edge Functions is quite helpful and probably could handle millions of records, is it worth it? Probably not. If the API is unstable, you're hitting time outs, and hitting runtime caps I personally would not see this as a viable option unless I started with over 95% of the data already in supabase.
2. Unstable API? - I've done two main methods, first is many much smaller tasks and each task verifies at the end it got its data. Setup a system to ensure that you go okay I got 1million records to pull. Lets pull 10k at a time and go off some key data lets say a date. Lets pull in Oct 1st's data, did it complete? yes/no. no wipe oct 1st and try again. While Oct 2nd, 3rd are not held down by oct 1st data and if go on their own too.
The second is a over the top system but a sort of middleman extra. Have the API data be sent to some noSQL DB like dynamoDB or w/e you like. Just store the raw data or very lightly parse it. Set it up so the API can run multiple times, no duplicates, that sort of deal bam. Then send to Supabase from your noSQL DB to properly parse it how you want.
3. Large data? - Personally if I had to handle this, I would tackle this in another way especially with an unstable API. Would have something that I could run locally and repeatedly until I got the main bulk of the data. Have that stored in my own local PostgreSQL. Then export the data into a .sql file and import that directly into supabase. Then once I got 98% of the data, have the edge functions with a few extra checks handle it from there.
Now 3 is dependent on a few things like how much actual data is being sent size wise and do you have a computer/server that can handle the amount of data being stored and processed.
Good luck, maybe and hopefully this helps in some tiny way. Maybe someone got a better idea too.