Parsing damn big CSV or XLSX files and bulk insert them into sql server
Hellooo friends, I'm currently trying to write an application that receives a file through form-data (.csv or .xlsx) and it's intended to have 600mb - 2gb size. I'm trying to bulkCopy this to my sqlserver DB. Now, I'm currently using ExcelDataReader to cast either csv or xlsx to DataTable, and then bulkCopying everything to the db. Is a good solution, but I wanted it faster, do you know how could I improve this flow?
16 Replies
Also, probably because of the dataTable casting, my local RAM is frying my notebook everytime I upload a 600mb csv.
Not sure about EXCEL, but with a CSV you can load it line by line instead of the whole file at once. This way you can process the file without frying your memory.
From my experience working with excel files from code is always a pain...
Maybe the 1br challenge solutions can help with reading large files for you https://github.com/gunnarmorling/1brc
GitHub
GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row...
1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java - gunnarmorling/1brc
Definitely going to give a look.
Like, the problem itself isn't the row size, because there is only 3 damn columns in the csv / xlsx. But there is like infinite rows...
Doesn't really matter, still a ton of data to read. Try focusing on CSV for now.
Use methods for iterating the file's contents line by line instead of loading everything at once into a string
GitHub
1BRC - Safe C# 5.5 sec on i5-1240P 2.1G 12c16t · gunnarmorling 1br...
Git repository - https://github.com/yurvon-screamo/1brc Gist source - https://gist.github.com/yurvon-screamo/a2623ce4ae5664078279a7dbb5328b2c dotnet 8, aot, i5-1240P 2.1G 12c16t, 16gb, nvme
SqlBulkCopy.WriteToServer accepts I/DbDataReader, which allows you to "stream" the data. You should use this rather than DataTable which will require you to load the entire dataset into memory. You should be able to pass the ExcelDataReader directly to this method, since it implements IDataReader. Whether it "works" or not is another thing, as WriteToServer expects certain behaviors from the IDataReader which ExcelDataReader might not correctly support.
Honestly, I think it is an API design mistake that WriteToServer has an overload accepting DataTable, since it is trivial to create a DbDataReader over a DataTable
DataSet.CreateDataReader Method (System.Data)
Returns a DataTableReader with one result set per DataTable, in the same sequence as the tables appear in the Tables collection.
You can also consider alternative libraries that might provide better performance. I know a couple, because I wrote them:
https://github.com/markpflug/benchmarks
GitHub
GitHub - MarkPflug/Benchmarks: Benchmarks for various .NET libraries.
Benchmarks for various .NET libraries. Contribute to MarkPflug/Benchmarks development by creating an account on GitHub.
@MarkPflug I'm still trying to implement this version envolving IDbDataReader, and would you know if it is possible to inject a new column with a DefaultValue? I was doing via DataTable mostly because I needed to inject a Id column in the bulkCopy, and I was doing it inside the DataTable, like this:
DataColumn dataframeIdColumn = new DataColumn("DataframeId", typeof(Guid))
{
DefaultValue = dataframe.Id
};
dataTable.Columns.Add(dataframeIdColumn);
foreach (DataColumn column in dataTable.Columns)
{
bulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName);
}
bulkCopy.BatchSize = 10000;
bulkCopy.BulkCopyTimeout = 0;
await bulkCopy.WriteToServerAsync(dataTable);
Do you know how could I do this with the IDbDataReader?
I have some code that I'll point you at... give me a few.
Damn you are good in this programming thing, tysm!
Here is a complete example using my libraries (Sylvan.Data and Sylvan.Data.Csv, and would also work with Sylvan.Data.Excel):
And if you want to change the column order, you can use the
Select
extension method, also from Sylvan.Data:
Does the end user need all of that csv/excel file or only a query of those tables you have
Sorry, I don't understand what you're asking.
User requirements