C
Join ServerC#
help
Find file duplicates | Optimize
Bbookuha9/13/2022
How can I optimize this?
https://gist.github.com/bookuha/707027336b40663b14c3cb64b039e9ec
https://gist.github.com/bookuha/707027336b40663b14c3cb64b039e9ec
EEro9/13/2022
Avoid linq, perhaps
EEro9/13/2022
Use pointers, stackallocs
EEro9/13/2022
perhaps
EEro9/13/2022
Linq will create and go through an enumerator. If you can avoid that, you've already made up some time
EEro9/13/2022
I don't really see how the file length matters
EEro9/13/2022
2 files can have the same length but have different content
EEro9/13/2022
That's a given
Qqqdev9/13/2022
There are even cases with .NET 7 now where an explicit implementation would be slower 

EEro9/13/2022
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);
var skips = 1;
using var eInner = fileInfos.GetEnumerator();
while (eInner.MoveNext())
{
var fiInner = eInner.Current;
var skipped = 0;
using var eOuter = fileInfo.GetEnumerator();
while (eOuter.MoveNext())
{
if (skipped++ < skip)
continue;
var fiOuter = eOuter.Current;
if (FileContentsEqual(fiInner, fiOuter))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
i++;
}
}
bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;
using var br1 = new BinaryReader(fi1.OpenRead());
using var br2 = new BinaryReader(fi2.OpenRead());
while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadByte();
var b2 = br2.ReadByte();
if (b1 != b2)
return false;
}
return true;
}
EEro9/13/2022
I dunno
EEro9/13/2022
I don't see how making this async has anything to do with the perf of it
EEro9/13/2022
Hm, not a bad idea
EEro9/13/2022
How does that work? You'd obviously not wanna read the entire file at once
EEro9/13/2022
My code isn't great either, going byte by byte
EEro9/13/2022
Ideally you'd read in chunks
EEro9/13/2022
Then fix those byte arrays and for over the pointers
EEro9/13/2022
That'd be the most performant i think
EEro9/13/2022
Actually the hashing approach is pretty decent
EEro9/13/2022
I'm not sure where that bottlenecks
EEro9/13/2022
Would you even need to read in chunks when you use the hash?
EEro9/13/2022
Like if you use sha512
EEro9/13/2022
Wouldn't that mean the files are equal?
EEro9/13/2022
Hm
Qqqdev9/13/2022
The sequence equal approach can probably also be improved
Qqqdev9/13/2022
For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho
Qqqdev9/13/2022
Looks already pretty good
EEro9/13/2022
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);
var skips = 1;
using var eOuter = fileInfos.GetEnumerator();
while (eOuter.MoveNext())
{
var fiOuter = eOuter.Current;
var skipped = 0;
using var eInner = fileInfo.GetEnumerator();
while (eInner.MoveNext())
{
if (skipped++ < skip)
continue;
var fiInner = eOuter.Current;
if (FileContentsEqual(fiOuter, fiInner))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
}
}
unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;
var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());
if (ComputeHash(s1) == ComputeHash(s2))
return true;
s1.Position = s2.Position = 0;
using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));
const int CHUNK_SIZE = 256;
while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadBytes(CHUNK_SIZE);
var b2 = br2.ReadBytes(CHUNK_SIZE);
fixed (byte* pB1 = b1, pB2 = b2)
{
for (int i = 0; i < b1.Length; i++)
{
if (pB1[i] != pB2[i])
return false;
}
}
}
return true;
}
string ComputeHash(Stream stream)
{
using var sha512 = SHA512.Create();
var hash = sha512.ComputeHash(stream);
return Convert.ToBase64String(hash);
}
EEro9/13/2022
Pepega ass code
EEro9/13/2022
Don't know if that one using directive is legal
EEro9/13/2022
On my phone lol
EEro9/13/2022
Ah yeah that's fair
EEro9/13/2022
Obviously
EEro9/13/2022
Since it's smaller
EEro9/13/2022
But just to make sure
Bbookuha9/13/2022
I can’t really dive into the solutions right now, driving
home. Thank you
home. Thank you
Bbookuha9/13/2022
Will check a bit later
I am currently grouping things up by their size
I am currently grouping things up by their size
Bbookuha9/13/2022
And then comparing files in the resulting buckets
Bbookuha9/13/2022
But my comparison approach is not good
EEro9/13/2022
Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes
EEro9/13/2022
Rare
Bbookuha9/13/2022
Oh, yes. Thank you
EEro9/13/2022
Hah, yeah
EEro9/13/2022
I won't be home for another 5 hours so it's difficult for me
EEro9/13/2022
Ah i was assuming the steam directory
EEro9/13/2022
Also different for Linux imagine
EEro9/13/2022
You guys have like a billion empty config files
Bbookuha9/14/2022
Thank you guys!