C#
C#

help

Root Question Message

Azure
Azure9/6/2022
File comparison

I want to compare two files very fast, I've implemented a full comparison (Comparing size / 1024 bytes spans),
but I feel like there might some very unique info in File headers that I can use for my comparison.
What do you think?
toddlahakbar
toddlahakbar9/6/2022
If you want a full comparison, then do a full comparison
toddlahakbar
toddlahakbar9/6/2022
Also last- first is usually a really quick algorithm to search
mtreit
mtreit9/6/2022
@531896925838901268 what do you mean by unique info in file headers? What kind of files are we talking about?
Susko3
Susko39/6/2022
if you want to do it really fast, then look into vectorization with eg. Vector256 class
Susko3
Susko39/6/2022
I suppose you're comparing for (in)equality?
Jester
Jester9/7/2022
the bottle neck is probably just reading the files
Azure
Azure9/7/2022
any i guess, there must be some info that describes the file
Azure
Azure9/7/2022
to the OS
Azure
Azure9/7/2022
in some of these bytes
Azure
Azure9/7/2022
yes
Kouhai
Kouhai9/7/2022
what do you mean info that describes the file?
Azure
Azure9/7/2022
Headers that OS use to identify it.
I am just guessing
Kouhai
Kouhai9/7/2022
If you're doing full file comparison, why would the file headers matter 😅 ?
Azure
Azure9/7/2022
I dont really want to do full comparison, I want to do it as fast as possible
Azure
Azure9/7/2022
I need to compare them, but going full size is more expensive
Azure
Azure9/7/2022
so I decided to find some array of bytes that can be compared
Azure
Azure9/7/2022
instead of the entire file
Susko3
Susko39/7/2022
compare sections and stop when you find a non-matching section, makes sense
Susko3
Susko39/7/2022
the important question for you may be: "where do the files usually differ?"
Susko3
Susko39/7/2022
maybe in the beginning, maybe in the end?
Azure
Azure9/7/2022
yes! i want to find out
Kouhai
Kouhai9/7/2022
The header for two files can be the same but the rest Is different, even for a simple bmp file.
Azure
Azure9/7/2022
hmm, got it
Susko3
Susko39/7/2022
well.... you're the one with the files
Jester
Jester9/7/2022
its possible to listen for file chanfes in a folder. maybe make a hash for every changed file so you can compare hashes?
Jester
Jester9/7/2022
oh yea you can also compare the file name first and then the last modified date
mtreit
mtreit9/7/2022
Files are completely arbitrary sequences of bytes. Unless you are restricted to a very specific file format you can't rely on headers - many file formats have no headers. (Think of a text file)

File names and things like the modified time, which are file system metadata and not part of the file itself, tell you nothing about the file contents and should never be relied upon.

If you have two files and want to compare them for equality, hashing both of them is much of the time far more expensive than doing a byte-by-byte comparison. (If you pre-hash and store the hashes for later comparison it can be useful.)

Comparing file size first will immediately tell you if the files are different without doing any file I/O.

If the file sizes are the same you can proceed with a byte-wise comparison.

If you do precompute hashes consider using a very fast hash (crc32, murmurhash, etc) on portions of the file (first 1K, last 1K, etc) and store that for fast invalidation of files that don't match.

Note that I can easily pad files with arbitrary extra bytes to make your diff check fail - this is why detection by exact match for things like malware is fragile and not generally a great approach.
Azure
Azure9/13/2022
Thank you
ContactFrequently Asked QuestionsJoin The DiscordBugs & Feature RequestsTerms & Privacy