C#•3mo ago

Parsing damn big CSV or XLSX files and bulk insert them into sql server

Hellooo friends, I'm currently trying to write an application that receives a file through form-data (.csv or .xlsx) and it's intended to have 600mb - 2gb size. I'm trying to bulkCopy this to my sqlserver DB. Now, I'm currently using ExcelDataReader to cast either csv or xlsx to DataTable, and then bulkCopying everything to the db. Is a good solution, but I wanted it faster, do you know how could I improve this flow?

16 Replies

NagaPeaceOP•3mo ago

Also, probably because of the dataTable casting, my local RAM is frying my notebook everytime I upload a 600mb csv.

Denis•3mo ago

Not sure about EXCEL, but with a CSV you can load it line by line instead of the whole file at once. This way you can process the file without frying your memory. From my experience working with excel files from code is always a pain...

Denis•3mo ago

Maybe the 1br challenge solutions can help with reading large files for you https://github.com/gunnarmorling/1brc

GitHub

GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row...

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java - gunnarmorling/1brc

NagaPeaceOP•3mo ago

Definitely going to give a look. Like, the problem itself isn't the row size, because there is only 3 damn columns in the csv / xlsx. But there is like infinite rows...

Denis•3mo ago

Doesn't really matter, still a ton of data to read. Try focusing on CSV for now. Use methods for iterating the file's contents line by line instead of loading everything at once into a string

Denis•3mo ago

https://github.com/gunnarmorling/1brc/discussions/303

GitHub

1BRC - Safe C# 5.5 sec on i5-1240P 2.1G 12c16t · gunnarmorling 1br...

Git repository - https://github.com/yurvon-screamo/1brc Gist source - https://gist.github.com/yurvon-screamo/a2623ce4ae5664078279a7dbb5328b2c dotnet 8, aot, i5-1240P 2.1G 12c16t, 16gb, nvme

MarkPflug•3mo ago

SqlBulkCopy.WriteToServer accepts I/DbDataReader, which allows you to "stream" the data. You should use this rather than DataTable which will require you to load the entire dataset into memory. You should be able to pass the ExcelDataReader directly to this method, since it implements IDataReader. Whether it "works" or not is another thing, as WriteToServer expects certain behaviors from the IDataReader which ExcelDataReader might not correctly support.

MarkPflug•3mo ago

Honestly, I think it is an API design mistake that WriteToServer has an overload accepting DataTable, since it is trivial to create a DbDataReader over a DataTable

DataSet.CreateDataReader Method (System.Data)

Returns a DataTableReader with one result set per DataTable, in the same sequence as the tables appear in the Tables collection.

MarkPflug•3mo ago

You can also consider alternative libraries that might provide better performance. I know a couple, because I wrote them: https://github.com/markpflug/benchmarks

GitHub

GitHub - MarkPflug/Benchmarks: Benchmarks for various .NET libraries.

Benchmarks for various .NET libraries. Contribute to MarkPflug/Benchmarks development by creating an account on GitHub.

NagaPeaceOP•3mo ago

@MarkPflug I'm still trying to implement this version envolving IDbDataReader, and would you know if it is possible to inject a new column with a DefaultValue? I was doing via DataTable mostly because I needed to inject a Id column in the bulkCopy, and I was doing it inside the DataTable, like this: DataColumn dataframeIdColumn = new DataColumn("DataframeId", typeof(Guid)) { DefaultValue = dataframe.Id }; dataTable.Columns.Add(dataframeIdColumn); foreach (DataColumn column in dataTable.Columns) { bulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName); } bulkCopy.BatchSize = 10000; bulkCopy.BulkCopyTimeout = 0; await bulkCopy.WriteToServerAsync(dataTable); Do you know how could I do this with the IDbDataReader?

MarkPflug•3mo ago

I have some code that I'll point you at... give me a few.

NagaPeaceOP•3mo ago

Damn you are good in this programming thing, tysm!

MarkPflug•3mo ago

Here is a complete example using my libraries (Sylvan.Data and Sylvan.Data.Csv, and would also work with Sylvan.Data.Excel):

using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data.Common;

var data =
    """
    a,b,c
    1,2,3
    4,5,6
    """;

var csv = CsvDataReader.Create(new StringReader(data));

// WithColumns is an extension method from Sylvan.Data nupkg
// it should work with any (conformant) DbDataReader
DbDataReader dr = csv.WithColumns(
    new CustomDataColumn<int>("Row", r => csv.RowNumber),
    new CustomDataColumn<Guid>("UniqueId", r => Guid.NewGuid())
);

var sw = new StringWriter();

using (var w = CsvDataWriter.Create(sw))
{
    w.Write(dr);
}

var str = sw.ToString();

Console.WriteLine(str);
// OUTPUT:
// a,b,c,Row,UniqueId
// 1,2,3,1,156a0452-a5c5-4c7c-bf17-a1cdcd377b42
// 4,5,6,2,6f6ebb1d-3d78-4812-b0ea-44700c943f98

using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data.Common;

var data =
    """
    a,b,c
    1,2,3
    4,5,6
    """;

var csv = CsvDataReader.Create(new StringReader(data));

// WithColumns is an extension method from Sylvan.Data nupkg
// it should work with any (conformant) DbDataReader
DbDataReader dr = csv.WithColumns(
    new CustomDataColumn<int>("Row", r => csv.RowNumber),
    new CustomDataColumn<Guid>("UniqueId", r => Guid.NewGuid())
);

var sw = new StringWriter();

using (var w = CsvDataWriter.Create(sw))
{
    w.Write(dr);
}

var str = sw.ToString();

Console.WriteLine(str);
// OUTPUT:
// a,b,c,Row,UniqueId
// 1,2,3,1,156a0452-a5c5-4c7c-bf17-a1cdcd377b42
// 4,5,6,2,6f6ebb1d-3d78-4812-b0ea-44700c943f98

And if you want to change the column order, you can use the Select extension method, also from Sylvan.Data:

dr = dr.Select("Row", "a", "b", "c", "UniqueId");

dr = dr.Select("Row", "a", "b", "c", "UniqueId");

Michel•3mo ago

Does the end user need all of that csv/excel file or only a query of those tables you have

MarkPflug•3mo ago

Sorry, I don't understand what you're asking.

Michel•3mo ago

User requirements

Gaming

Programming

Parsing damn big CSV or XLSX files and bulk insert them into sql server

Did you find this page helpful?