C
C#

Processing files as fast as possible

Processing files as fast as possible

Jjotalanusse11/17/2023
Hi everyone! I have a question that I’m sure you can help me with. I’m making a library that needs to parse files as fast as possible. The files in question are binary files containing a sort of protobuf but not quite protobuf. When parsing a file I only go forward through the file one time, and the data is not modified at all. I want to know what is the fastest way to go through the file. At the moment I’m using MemoryStreams, but when I want to process a part of what I just read I have to allocate a new byte[] and copy the data to it, and I think this can be avoided. 99% of the files I handle won't ever exceed 150MB, maybe even 200MB, so memory isn’t a problem. But the fewer memory I need to allocate apart from the file I’m reading the better. The ideal scenario would be for me to only have one copy of the data in memory, and then be able to reference said data by creating read-only slices to process it. Is this even possible? Is there a performance overhead for using MemoryStream instead of byte[]?
AAngius11/17/2023
Maybe you could peek into the MemoryStream with Memory<T> or Span<T> to avoid allocating a new byte[]?
Jjotalanusse11/17/2023
What do you mean by peek? I know that I could use Memory<byte> or ReadOnlyMemory<byte> as a way of storing the file data, so I can create references to it without copying it. But can this be done as well with MemoryStreams?
AAngius11/17/2023
I'm not sure if it can be done directly on MemoryStreams, but Memory and Span are the only ways to take a peek into the, well, memory without allocating a new byte[] or anything else At least as far as I know You can always try asking in #advanced, they might have some ways of utilizing unsafe blocks and what have you
Jjotalanusse11/17/2023
Okay, I'll look further into what you said, thanks! I'm going to ask for some help on #advanced as well then
JJenyaRostov11/17/2023
you can either load the entire file with File.ReadAllBytes and then process it one chunk at a time with Memory<T> or you can use FileStream and read n bytes at a time using Read method with Span<byte> overload that span can be allocated on the stack with stackalloc if you want to, that's practically free to do with constant size parameter
Vviceroypenguin11/17/2023
Generally you don't need MemoryStream at all. Just use the original FileStream, and Read()/ReadAsync() into a buffer that you get from a ArrayPool.Shared or a MemoryPool. Once you've read it, it's yours to parse as you will. I would avoid stackalloc for this type of read, due to the size. stackalloc is best for <1kb arrays, and the best size for reading from FileStream varies between 4kb and 64kb.
JJenyaRostov11/17/2023
icic
Jjotalanusse11/17/2023
I'm not familiar with ArrayPool.Shared, but doesn't it allocatee new memory every time I pass an array?
Vviceroypenguin11/17/2023
Depends on how you use it. If you ask for 256MB arrays each time, then yes. If you ask for 64kb array, then no - it will keep the reference around for a while and so when you ask for it again you'll get the same one back. The biggest question I guess is whether you need the entire file in memory at once or not.
Jjotalanusse11/17/2023
But wouldn't using Memory<byte> be better as it allows me to have just one copy of the data at all times?
Vviceroypenguin11/17/2023
if you need the entire file in memory, then yeah, just File.ReadAllBytes(Async)() that's... orthogonal
Jjotalanusse11/17/2023
?
Vviceroypenguin11/17/2023
regardless of how you get the array, if you keep it for the length of processing, then you'll have "just one copy fo the data at all times"
Jjotalanusse11/17/2023
I benefit more by having a 100MB file in memory than having to wait for the I/O when I decide to do FileStream.Read()
Vviceroypenguin11/17/2023
eh... $itdepends
Vviceroypenguin11/17/2023
I've seen many algorithms improve by kicking off a read, and doing some processing while the read is executing in the background. again, depending on how much you need at a time
Jjotalanusse11/17/2023
Pretty small, the chunks of data I read are on average 10KB (I have not tested this, it's just a guess) But I know I never exceed 100Kb for example
Vviceroypenguin11/17/2023
ok, then I would, as a way to test, rent a 128kb array. await reading the first 128kb. then start the task to read the second 128kb. while the task is running, process the 128kb. when you get done processing it, await the reading task, and repeat, until you get to the end. then you're reading in parallel to processing.
Jjotalanusse11/17/2023
For what I have been testing, right now using FileStream to directly read the data is out of the question, it is by far the slowest method My problem is that arrays are constantly being sub-divided into smaller ones
Vviceroypenguin11/17/2023
oh, then that's your problem. once you have the array, create an ArraySegment from it. the ArraySegment is just a pointer to the array, but it can be divided up without allocating new smaller arrays.
Jjotalanusse11/17/2023
Got it, well the only thing left to do is to try all these things a find the one that better fits my needs. Thanks!
Ccap5lut11/18/2023
hey, i know im coming late to the party, but what would u think about something like this? https://paste.mod.gg/zwkfpbybcqil/0 its just a draft implementation, so stuff like clean up and alike arent shown yet (its also currently using spin waits) the basic idea is, to rent an array that can fit the whole file, then wrap it in a memory stream and start copying from file to memory stream asynchronously without awaiting it as core functionality u have 2 methods, one that waits until N bytes are available, and another to get a span and advance the internal position i added a sync and an async example usage method at the end, which will read a length prefixed byte array and read it as utf8 encoded string the file sized backed buffer array is mainly to avoid having to care about (i think its called) tearing the data
Jjotalanusse11/18/2023
Not late at all! Thank you very much for the example implementation, I will check it out
Vviceroypenguin11/18/2023
I'd be concerned with await Task.Yield();, as in a low-contention environment, this could essentially just turn into a cpu busy-loop, as the task scheduler just reschedules the current task while theh reading task is doing it's thing. if length is large, this could take some time.
Ccap5lut11/18/2023
yeah i would add some kind of behaviour how to await/wait to have some more fine grained control or maybe even with events, but for that ya would have to wrap the stream again which is sorta outside this little draft
MMMayor McCheese11/18/2023
Is there a data interleaving concern with leasing arrays where doing things "wrong" works okay, but at scale produces interleaving problems? Certain performance python libs have interleaving issues at the higher end of scale.

Looking for more? Join the community!

C
C#

Processing files as fast as possible

Join Server
Want results from more Discord servers?
Add your server
Recommended Posts
✅ [SOLVED] Does the 3rd line of code hold memory as a reference or does it hold the actual value?ayooo, quick question. pretty simple. Book is a custom class just so you know.... ```csharp Book boRequestsHello, i have never done something that i will say before, how can i send a request to an API Url, gWhy is my grid on Line 28 not calling back to my other script called Grid?✅ Visual Studio 2022 Failing to Create C# ProjectWhat's happening here?✅ Help with creating shapes on a windows form appI have created a windows form app to display shapes on a bitmap, my circle function works fine but mNo inbuilt functionsHello, im a year 1 student so my knowledge is not that massive. So i have this task for university wSignalR initial connection takes way too long (logs included in comments)Hey there. So I have two versions of the same application, using essentially the same code. One of ✅ How to make DontDestroyOnLoad work for only single scene?Hello! I have a problem. In my game, when I am on my Main Menu Scene and trying to load Main Game ScUI is lagging while scrolling a listViewHi, I wanted to create an application that retrieves data from the API and displays it. I'm having tHow can i fix this error code?he's giving me an error code for private?Visual Studio 2022 ExtensionI'm working on a visual studio 2022 Extension. But as soon as i import it inside the visual studio 2✅ ValidationContext in ASP.NETWhat does ObjectInstance and ObjectType mean? afaik 1) ValidationContext is basically on which claUpdated from 2021 to 2022 version of unity, getting NavMeshSurface related error.Hey y'all, I'm an absolute newbie trying to make a basic game and figured the unity tutorials would Cache not clearing, what am I missing?Does anyone know why this cache is not clearing? I am trying to implement simple rate limiting an aiCal (webcal://) does not doing a sync in outlook365I have my .net core app, and my app generates a link using iCal.NET nuget, and initially it generateDesign Pattern for mapping generic class type to implementationHey all, I'm working on an executor service that maps an executor record of a generic type to an ex✅ Blazor Tutorial (Todo List) Doesn't WorkHello guys, I'm new in Blazor. ```@page "/todo" <PageTitle>Todo</PageTitle> <h1>Todo (@todos.CountString AppendI have the following code: ``` blablabla.1 { row1.withCode row2.withCode row3.withCode row4.Best way to limit file size when downloading a urlSo im adding a way to ahve custom images for xp but dont want to have people put in custom urls thatHi guys I wrote a code but I need to do it simplerSo I wrote a code on c# but I need to do it simpler to make algoritm from it can you help here is co