C
C#3mo ago
NagaPeace

Parsing damn big CSV or XLSX files and bulk insert them into sql server

Hellooo friends, I'm currently trying to write an application that receives a file through form-data (.csv or .xlsx) and it's intended to have 600mb - 2gb size. I'm trying to bulkCopy this to my sqlserver DB. Now, I'm currently using ExcelDataReader to cast either csv or xlsx to DataTable, and then bulkCopying everything to the db. Is a good solution, but I wanted it faster, do you know how could I improve this flow?
16 Replies
NagaPeace
NagaPeaceOP3mo ago
Also, probably because of the dataTable casting, my local RAM is frying my notebook everytime I upload a 600mb csv.
Denis
Denis3mo ago
Not sure about EXCEL, but with a CSV you can load it line by line instead of the whole file at once. This way you can process the file without frying your memory. From my experience working with excel files from code is always a pain...
Denis
Denis3mo ago
Maybe the 1br challenge solutions can help with reading large files for you https://github.com/gunnarmorling/1brc
GitHub
GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row...
1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java - gunnarmorling/1brc
NagaPeace
NagaPeaceOP3mo ago
Definitely going to give a look. Like, the problem itself isn't the row size, because there is only 3 damn columns in the csv / xlsx. But there is like infinite rows...
Denis
Denis3mo ago
Doesn't really matter, still a ton of data to read. Try focusing on CSV for now. Use methods for iterating the file's contents line by line instead of loading everything at once into a string
MarkPflug
MarkPflug3mo ago
SqlBulkCopy.WriteToServer accepts I/DbDataReader, which allows you to "stream" the data. You should use this rather than DataTable which will require you to load the entire dataset into memory. You should be able to pass the ExcelDataReader directly to this method, since it implements IDataReader. Whether it "works" or not is another thing, as WriteToServer expects certain behaviors from the IDataReader which ExcelDataReader might not correctly support.
MarkPflug
MarkPflug3mo ago
Honestly, I think it is an API design mistake that WriteToServer has an overload accepting DataTable, since it is trivial to create a DbDataReader over a DataTable
DataSet.CreateDataReader Method (System.Data)
Returns a DataTableReader with one result set per DataTable, in the same sequence as the tables appear in the Tables collection.
MarkPflug
MarkPflug3mo ago
You can also consider alternative libraries that might provide better performance. I know a couple, because I wrote them: https://github.com/markpflug/benchmarks
GitHub
GitHub - MarkPflug/Benchmarks: Benchmarks for various .NET libraries.
Benchmarks for various .NET libraries. Contribute to MarkPflug/Benchmarks development by creating an account on GitHub.
NagaPeace
NagaPeaceOP3mo ago
@MarkPflug I'm still trying to implement this version envolving IDbDataReader, and would you know if it is possible to inject a new column with a DefaultValue? I was doing via DataTable mostly because I needed to inject a Id column in the bulkCopy, and I was doing it inside the DataTable, like this: DataColumn dataframeIdColumn = new DataColumn("DataframeId", typeof(Guid)) { DefaultValue = dataframe.Id }; dataTable.Columns.Add(dataframeIdColumn); foreach (DataColumn column in dataTable.Columns) { bulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName); } bulkCopy.BatchSize = 10000; bulkCopy.BulkCopyTimeout = 0; await bulkCopy.WriteToServerAsync(dataTable); Do you know how could I do this with the IDbDataReader?
MarkPflug
MarkPflug3mo ago
I have some code that I'll point you at... give me a few.
NagaPeace
NagaPeaceOP3mo ago
Damn you are good in this programming thing, tysm!
MarkPflug
MarkPflug3mo ago
Here is a complete example using my libraries (Sylvan.Data and Sylvan.Data.Csv, and would also work with Sylvan.Data.Excel):
using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data.Common;

var data =
"""
a,b,c
1,2,3
4,5,6
""";

var csv = CsvDataReader.Create(new StringReader(data));

// WithColumns is an extension method from Sylvan.Data nupkg
// it should work with any (conformant) DbDataReader
DbDataReader dr = csv.WithColumns(
new CustomDataColumn<int>("Row", r => csv.RowNumber),
new CustomDataColumn<Guid>("UniqueId", r => Guid.NewGuid())
);

var sw = new StringWriter();

using (var w = CsvDataWriter.Create(sw))
{
w.Write(dr);
}

var str = sw.ToString();

Console.WriteLine(str);
// OUTPUT:
// a,b,c,Row,UniqueId
// 1,2,3,1,156a0452-a5c5-4c7c-bf17-a1cdcd377b42
// 4,5,6,2,6f6ebb1d-3d78-4812-b0ea-44700c943f98
using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data.Common;

var data =
"""
a,b,c
1,2,3
4,5,6
""";

var csv = CsvDataReader.Create(new StringReader(data));

// WithColumns is an extension method from Sylvan.Data nupkg
// it should work with any (conformant) DbDataReader
DbDataReader dr = csv.WithColumns(
new CustomDataColumn<int>("Row", r => csv.RowNumber),
new CustomDataColumn<Guid>("UniqueId", r => Guid.NewGuid())
);

var sw = new StringWriter();

using (var w = CsvDataWriter.Create(sw))
{
w.Write(dr);
}

var str = sw.ToString();

Console.WriteLine(str);
// OUTPUT:
// a,b,c,Row,UniqueId
// 1,2,3,1,156a0452-a5c5-4c7c-bf17-a1cdcd377b42
// 4,5,6,2,6f6ebb1d-3d78-4812-b0ea-44700c943f98
And if you want to change the column order, you can use the Select extension method, also from Sylvan.Data:
dr = dr.Select("Row", "a", "b", "c", "UniqueId");
dr = dr.Select("Row", "a", "b", "c", "UniqueId");
Michel
Michel3mo ago
Does the end user need all of that csv/excel file or only a query of those tables you have
MarkPflug
MarkPflug3mo ago
Sorry, I don't understand what you're asking.
Michel
Michel3mo ago
User requirements

Did you find this page helpful?