C
C#•3mo ago
kurumi

Reading large xml file from archive by using XmlReader in Parallel mode

Hello 👋. I am looking for how can I read data from archive xml file in Parallel mode. I have archive someFiles.zip with my needed data and it has largeXmlFile.xml file inside. This file is 40gb. It looks kinda of it (but has thousands of objects :Ok:):
<root>
<OBJECT data1="123" data2="456" />
<OBJECT data1="321" data2="654" />
</root>
<root>
<OBJECT data1="123" data2="456" />
<OBJECT data1="321" data2="654" />
</root>
Now I am opening this file from archive and get Stream
using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();
using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();
then putting this Stream into XmlReader:
using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });
using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });
And I am simply reading it:
var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
{
objects.Add(ReadMyObject(xmlReader));
}
}
var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
{
objects.Add(ReadMyObject(xmlReader));
}
}
It takes ages for reading this file, so my question is: How can I change my code so I will read this XML in Parallel mode?
23 Replies
WEIRD FLEX
WEIRD FLEX•3mo ago
holy crap 40 gb xml couldn't you consider keeping a "cache" in an alternative format? especially if it's that simple
kurumi
kurumi•3mo ago
Hah, yeah. It's painful :harold: I have no other alternatives
WEIRD FLEX
WEIRD FLEX•3mo ago
why not, you could have a batch that translates xml to minimized json and use the json instead of the xml or rather, it's just <a b=c d=e /> then you could try rolling your own parser or just benchmarking it
kurumi
kurumi•3mo ago
this archive I got from government and they only provide XML format. So I will need extra parse this to json that is actually another task to do
WEIRD FLEX
WEIRD FLEX•3mo ago
but it's a small one
kurumi
kurumi•3mo ago
yeah, I was thinking of creating my own and somehow split Stream into multiple. But have no idea how lol
WEIRD FLEX
WEIRD FLEX•3mo ago
having a single reader from disk will be faster than having multiple readers to me it makes no sense to parallelize it, at least at that stage
kurumi
kurumi•3mo ago
hmm, so the best what I can do is move this file into fast SSD disk?
WEIRD FLEX
WEIRD FLEX•3mo ago
it's not in an ssd already?! really?
kurumi
kurumi•3mo ago
it is, but... I have 5 large files inside of this trojan horse ZIP bomb, hahha
WEIRD FLEX
WEIRD FLEX•3mo ago
how much would you want to improve the performance of this deserialization?
canton7
canton7•3mo ago
Have you actually profiled this to see what/where the bottlenecks are? That's step 0 in any optimisation problem
kurumi
kurumi•3mo ago
As much as possible with safe C# (or unsafe if it is not painful). Also I need to add these into local database
canton7
canton7•3mo ago
I rather suspect it's one of: 1. Reading that much data from disk 2. Zip decompression 3. Creating a list with 40gb of elements in it None of those are the actual XML parsing, and "Parallel mode" won't help with any of them
WEIRD FLEX
WEIRD FLEX•3mo ago
also do you have 40 GB of ram? because if not... it's all swapping like, how much ram this process takes?
canton7
canton7•3mo ago
XmlReader doesn't load the whole lot into ram at one time. That's the point. But, a List with 40gb of elements in will
WEIRD FLEX
WEIRD FLEX•3mo ago
no but how is the zip managed?
canton7
canton7•3mo ago
Pretty sure that's streamed too?
kurumi
kurumi•3mo ago
Yeah, I realized that I provide you wrong code. Actually, the limit of this list in real task is 1k elements and then it goes into local database. After the query completed the list will Clear and I fill it once again until see EOF
canton7
canton7•3mo ago
Still, you need to profile this before trying to optimise it As a very crude first pass: if you open task manager, is your CPU maxed out, or your disk I/O?
kurumi
kurumi•3mo ago
Alright, I will bench it and reply later But if you were me, what kind of steps you will do? And by reading 1k objects and pass 'em into DB is good idea or not? I am looking for some good advices now :heartowo:
WEIRD FLEX
WEIRD FLEX•3mo ago
how big a single object is? i woud still benchmark this, maybe optimal is 500 items, maybe 2000, who knows
canton7
canton7•3mo ago
Feels vaguely sensible, but you really need to have a profiler up. The no. 1 rule of optimization is that the slow-downs are never where you think they are So you can spend an awful lot of time trying things which are never going to make any difference, while missing the real problem entirely (and I mean actual profiling, not benchmarking. A profiler looks at your code as it's running and tells you where it's spending the most time)