Parsing/Deserializing Structured File’s Bytes
Hi all. I have a file that I need to break down and turn into data that I can visualize. I know how the file is structured and know how to convert each byte into its corresponding data type (str, int32/16/8, uint32/16/8, etc).
Where I’m struggling is figuring out what the best way to actually sort through the data is. So far my script is a jumbled mess of going byte by byte and decoding each one. What my plan is currently is to store the file header and data (there is a compression on everything after the header) in an object as raw bytes, then decompress, then create methods to convert each chunk of data (each stored with their own headers) in attributes within the file’s object.
Does it sound like I’m on the right track of things, or is there a better way to handle the parsing of structured files like this?
9 Replies
And forgive me if I’m using terminology incorrectly, I took one OOP class in college maybe 5 years ago and it was in Python. So I remember the gist of OOP but don’t remember the terminology as well
for sure first thing i would do is make models and parsing methods as 1:1 to the raw structure as they can be, no logic, just parsing, and build all the rest above that
so it kinda depends how this format is made, if it's linear, recursive, there is inheritance, there is composition, i could think of some ways but can't tell for sure without more details 🤷
I'd reccomend looking into the
SequenceReader
api as a basis for implementing a binary parser.BinaryPrimitives may also be of help.
Is there a benefit to doing that over BitConverter? Because that’s what I’ve been using, and it’s been working fine
Just looked through that, that’s super helpful, thanks!
I would assume it’s linear, but not entirely sure what you mean by that 😅
The structure of the file is:
File Header
Compression (everything below is within the compression)
Data Header (contains data name and size of data)
Data
Data Header (contains data name and size of data)
Data
Etc
Etc
So my thought was to read the file as a byte array
Separate the header from the compressed portion in an object
Write a method for decompressing the compressed portion
Write a method for searching for the name of each data type in the file and storing it within the object
Then writing methods for the decoding of each header/data type and storing that within the object
So calling the class with the byte array would save attributes within the object would look something like (Note: the data names are messages in the file)
File.hdr
File.decompData
File.msgTypes
File.msgArray
Then there would be methods that check to make sure the messages exist in File.msgTypes and then if they exist, convert them over. The structure of each message is a little different, so I’d have to have different methods for each.
File.getHdr: converts the header into readable data (data source, version number, collection date, etc)
File.getMsg15: checks to make sure Msg15 exists, converts the header into readable data, then converts the data as specified
File.getMsg31: checks to make sure Msg31 exists, converts the header into readable data, then converts the data as specified, etc, etc
Does that sound like a sensible way to handle it?
more explicit (indianness for example), more byte[] oriented
Gotcha, I’ve got a function already that handles the endianness, so if that’s the only thing, I’m probably gonna keep it as is. There’s no like performance benefit is there? I would assume it just goes through and flips the bit order if it needs to?
I’m more asking about structure to make sure it is fast and elegant to execute/read/edit
How big is the file.
if it's all a sequential or some of the data is flags/logic to parse data in a different way
Then there would be methods that check to make sure the messages exist in File.msgTypes andso this is a dynamic format? there could be arbitrary data?