C
C#kurumi

Reading large xml file from archive by using XmlReader in Parallel mode

Hello 👋. I am looking for how can I read data from archive xml file in Parallel mode. I have archive someFiles.zip with my needed data and it has largeXmlFile.xml file inside. This file is 40gb. It looks kinda of it (but has thousands of objects :Ok:):
<root>
<OBJECT data1="123" data2="456" />
<OBJECT data1="321" data2="654" />
</root>
<root>
<OBJECT data1="123" data2="456" />
<OBJECT data1="321" data2="654" />
</root>
Now I am opening this file from archive and get Stream
using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();
using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();
then putting this Stream into XmlReader:
using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });
using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });
And I am simply reading it:
var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
{
objects.Add(ReadMyObject(xmlReader));
}
}
var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
{
objects.Add(ReadMyObject(xmlReader));
}
}
It takes ages for reading this file, so my question is: How can I change my code so I will read this XML in Parallel mode?
C
csharpella41d ago
holy crap 40 gb xml couldn't you consider keeping a "cache" in an alternative format? especially if it's that simple
K
kurumi41d ago
Hah, yeah. It's painful :harold: I have no other alternatives
C
csharpella41d ago
why not, you could have a batch that translates xml to minimized json and use the json instead of the xml or rather, it's just <a b=c d=e /> then you could try rolling your own parser or just benchmarking it
K
kurumi41d ago
this archive I got from government and they only provide XML format. So I will need extra parse this to json that is actually another task to do
C
csharpella41d ago
but it's a small one
K
kurumi41d ago
yeah, I was thinking of creating my own and somehow split Stream into multiple. But have no idea how lol
C
csharpella41d ago
having a single reader from disk will be faster than having multiple readers to me it makes no sense to parallelize it, at least at that stage
K
kurumi41d ago
hmm, so the best what I can do is move this file into fast SSD disk?
C
csharpella41d ago
it's not in an ssd already?! really?
K
kurumi41d ago
it is, but... I have 5 large files inside of this trojan horse ZIP bomb, hahha
C
csharpella41d ago
how much would you want to improve the performance of this deserialization?
C
canton741d ago
Have you actually profiled this to see what/where the bottlenecks are? That's step 0 in any optimisation problem
K
kurumi41d ago
As much as possible with safe C# (or unsafe if it is not painful). Also I need to add these into local database
C
canton741d ago
I rather suspect it's one of: 1. Reading that much data from disk 2. Zip decompression 3. Creating a list with 40gb of elements in it None of those are the actual XML parsing, and "Parallel mode" won't help with any of them
C
csharpella41d ago
also do you have 40 GB of ram? because if not... it's all swapping like, how much ram this process takes?
C
canton741d ago
XmlReader doesn't load the whole lot into ram at one time. That's the point. But, a List with 40gb of elements in will
C
csharpella41d ago
no but how is the zip managed?
C
canton741d ago
Pretty sure that's streamed too?
K
kurumi41d ago
Yeah, I realized that I provide you wrong code. Actually, the limit of this list in real task is 1k elements and then it goes into local database. After the query completed the list will Clear and I fill it once again until see EOF
C
canton741d ago
Still, you need to profile this before trying to optimise it As a very crude first pass: if you open task manager, is your CPU maxed out, or your disk I/O?
K
kurumi41d ago
Alright, I will bench it and reply later But if you were me, what kind of steps you will do? And by reading 1k objects and pass 'em into DB is good idea or not? I am looking for some good advices now :heartowo:
C
csharpella41d ago
how big a single object is? i woud still benchmark this, maybe optimal is 500 items, maybe 2000, who knows
C
canton741d ago
Feels vaguely sensible, but you really need to have a profiler up. The no. 1 rule of optimization is that the slow-downs are never where you think they are So you can spend an awful lot of time trying things which are never going to make any difference, while missing the real problem entirely (and I mean actual profiling, not benchmarking. A profiler looks at your code as it's running and tells you where it's spending the most time)
Want results from more Discord servers?
Add your server
More Posts
Resizing the Console window - Console ApplicationHi there, I'm trying to resize the console window in a Console Application. I have tried multiple ✅ Creating a datetime with a timezoneHow do I create a `DateTime` with `new DateTime(2010, 05, 12);` and give it a specific `TimeZoneInfoCustom Newtonsoft JSON deserializationI have a dictionary where: - The key will always be a string - The value will be either: - string Issue with Form.Close() in WinformsI have a custom form called CustomMessageBox In that form, I have a method that defines a button clIssue with IMGUI.Net.Hello, I am creating an ImGUI application and It doesn't let me move the ui across my entire screen,Issue with deserializing Enums in saved JSONHey, I've been trying to deserialize a list of cards I serialized in C#, the issue I'm having is it From HttpClient, can I save/copy the request to share with someone?Is there a way to copy raw request details? For example, I'm looking for a similar output like: -3D Unity RaycastingHi guys, currently working on a puzzle game involving lasers and mirrors, however, have hit a brick Computer Science - Computer ArchitectureOn a 64 bit system (where the bus length and word size are 64 bits) how would we determine the ammouConnecting SQLite in MAUIHow to connect the database without absolute path? If with it, the program works only on the computeDifferent HttpClient headersWhen I make an HTTP request in my browser (Chrome), I get different response headers than I do in myHttpRequestException:An error occurred while sending the request. WebExceptionHttpRequestException:An error occurred while sending the request. WebException:Unable to connect to ✅ EF does not put datetimes properly to the datebaseI am writing a backend app in asp.net core. This is my service code : ```c# public virtual async TaShadow Prop Error with IdentityDbContextThis is my IdentityDbContext: public class ApplicationUser : IdentityUser { [Required(ErrorMessWpf custom button template probleHi! I am trying to create custom button control, but I have a problem with a content property. ButtoWinfoms problem in Hosts fileI am writing a mini site blocker, the program has 2 functions to block the site and add the site thaDTO validation with DRY principle (without repeating the same validation)Hello, I would like to know what is used in practice, regarding the DTO validation. Let's say for 2 I'm using fedora, and i want create some .NET formsWhat should i do to create projects with ASP.NET forms in Linux system(Fedora)?✅ My API returns 500 internal server error with the SQL exception text, why?Hello every one, I built an ASP.NET API in .NET and I wrote a controller Action that looks like thiAdvice on how to correctly integrate more complex Javascript codeHey, I've developed an ASP.NET app before, but that one was pretty much all about the backend and t