Find file duplicates | Optimize

EEro9/13/2022
Avoid linq, perhaps
EEro9/13/2022
Use pointers, stackallocs
EEro9/13/2022
perhaps
EEro9/13/2022
Linq will create and go through an enumerator. If you can avoid that, you've already made up some time
EEro9/13/2022
I don't really see how the file length matters
EEro9/13/2022
2 files can have the same length but have different content
EEro9/13/2022
That's a given
Qqqdev9/13/2022
There are even cases with .NET 7 now where an explicit implementation would be slower :kekw:
EEro9/13/2022
void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eInner = fileInfos.GetEnumerator();

  while (eInner.MoveNext())
  {
    var fiInner = eInner.Current;

    var skipped = 0;
    using var eOuter = fileInfo.GetEnumerator();

    while (eOuter.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiOuter = eOuter.Current;
  
      if (FileContentsEqual(fiInner, fiOuter))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }

    i++;
  }
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  using var br1 = new BinaryReader(fi1.OpenRead());
  using var br2 = new BinaryReader(fi2.OpenRead());

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadByte();
    var b2 = br2.ReadByte();

    if (b1 != b2)
      return false;
  }

  return true;
}
EEro9/13/2022
I dunno
EEro9/13/2022
I don't see how making this async has anything to do with the perf of it
EEro9/13/2022
Hm, not a bad idea
EEro9/13/2022
How does that work? You'd obviously not wanna read the entire file at once
EEro9/13/2022
My code isn't great either, going byte by byte
EEro9/13/2022
Ideally you'd read in chunks
EEro9/13/2022
Then fix those byte arrays and for over the pointers
EEro9/13/2022
That'd be the most performant i think
EEro9/13/2022
Actually the hashing approach is pretty decent
EEro9/13/2022
I'm not sure where that bottlenecks
EEro9/13/2022
Would you even need to read in chunks when you use the hash?
EEro9/13/2022
Like if you use sha512
EEro9/13/2022
Wouldn't that mean the files are equal?
EEro9/13/2022
Hm
Qqqdev9/13/2022
The sequence equal approach can probably also be improved
Qqqdev9/13/2022
For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho
Qqqdev9/13/2022
Looks already pretty good
EEro9/13/2022
void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eOuter = fileInfos.GetEnumerator();

  while (eOuter.MoveNext())
  {
    var fiOuter = eOuter.Current;

    var skipped = 0;
    using var eInner = fileInfo.GetEnumerator();

    while (eInner.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiInner = eOuter.Current;
  
      if (FileContentsEqual(fiOuter, fiInner))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }
  }
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

  if (ComputeHash(s1) == ComputeHash(s2))
    return true;

  s1.Position = s2.Position = 0;

  using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

  const int CHUNK_SIZE = 256;

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadBytes(CHUNK_SIZE);
    var b2 = br2.ReadBytes(CHUNK_SIZE);

    fixed (byte* pB1 = b1, pB2 = b2)
    {
      for (int i = 0; i < b1.Length; i++)
      {
        if (pB1[i] != pB2[i])
          return false;
      }
    }
  }

  return true;
}

string ComputeHash(Stream stream)
{
  using var sha512 = SHA512.Create();

  var hash = sha512.ComputeHash(stream);
  return Convert.ToBase64String(hash);
}
EEro9/13/2022
Pepega ass code
EEro9/13/2022
Don't know if that one using directive is legal
EEro9/13/2022
On my phone lol
EEro9/13/2022
Ah yeah that's fair
EEro9/13/2022
Obviously
EEro9/13/2022
Since it's smaller
EEro9/13/2022
But just to make sure
Bbookuha9/13/2022
I can’t really dive into the solutions right now, driving
home. Thank you
Bbookuha9/13/2022
Will check a bit later
I am currently grouping things up by their size
Bbookuha9/13/2022
And then comparing files in the resulting buckets
Bbookuha9/13/2022
But my comparison approach is not good
EEro9/13/2022
Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes
EEro9/13/2022
Rare
Bbookuha9/13/2022
Oh, yes. Thank you
EEro9/13/2022
Hah, yeah
EEro9/13/2022
I won't be home for another 5 hours so it's difficult for me
EEro9/13/2022
Ah i was assuming the steam directory
EEro9/13/2022
Also different for Linux imagine
EEro9/13/2022
You guys have like a billion empty config files
Bbookuha9/14/2022
Thank you guys!