C
C#

help

Find file duplicates | Optimize

Eero9/13/2022
Avoid linq, perhaps Use pointers, stackallocs
perhaps
Linq will create and go through an enumerator. If you can avoid that, you've already made up some time I don't really see how the file length matters 2 files can have the same length but have different content That's a given
Qqqdev9/13/2022
There are even cases with .NET 7 now where an explicit implementation would be slower kekw
Eero9/13/2022
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eInner = fileInfos.GetEnumerator();

while (eInner.MoveNext())
{
var fiInner = eInner.Current;

var skipped = 0;
using var eOuter = fileInfo.GetEnumerator();

while (eOuter.MoveNext())
{
if (skipped++ < skip)
continue;

var fiOuter = eOuter.Current;

if (FileContentsEqual(fiInner, fiOuter))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}

i++;
}
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

using var br1 = new BinaryReader(fi1.OpenRead());
using var br2 = new BinaryReader(fi2.OpenRead());

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadByte();
var b2 = br2.ReadByte();

if (b1 != b2)
return false;
}

return true;
}
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eInner = fileInfos.GetEnumerator();

while (eInner.MoveNext())
{
var fiInner = eInner.Current;

var skipped = 0;
using var eOuter = fileInfo.GetEnumerator();

while (eOuter.MoveNext())
{
if (skipped++ < skip)
continue;

var fiOuter = eOuter.Current;

if (FileContentsEqual(fiInner, fiOuter))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}

i++;
}
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

using var br1 = new BinaryReader(fi1.OpenRead());
using var br2 = new BinaryReader(fi2.OpenRead());

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadByte();
var b2 = br2.ReadByte();

if (b1 != b2)
return false;
}

return true;
}
I dunno I don't see how making this async has anything to do with the perf of it Hm, not a bad idea How does that work? You'd obviously not wanna read the entire file at once My code isn't great either, going byte by byte Ideally you'd read in chunks Then fix those byte arrays and for over the pointers That'd be the most performant i think Actually the hashing approach is pretty decent I'm not sure where that bottlenecks Would you even need to read in chunks when you use the hash? Like if you use sha512 Wouldn't that mean the files are equal? Hm
Qqqdev9/13/2022
The sequence equal approach can probably also be improved For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs,998a36a55f580ab1 Looks already pretty good
Eero9/13/2022
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eOuter = fileInfos.GetEnumerator();

while (eOuter.MoveNext())
{
var fiOuter = eOuter.Current;

var skipped = 0;
using var eInner = fileInfo.GetEnumerator();

while (eInner.MoveNext())
{
if (skipped++ < skip)
continue;

var fiInner = eOuter.Current;

if (FileContentsEqual(fiOuter, fiInner))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
}
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

if (ComputeHash(s1) == ComputeHash(s2))
return true;

s1.Position = s2.Position = 0;

using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

const int CHUNK_SIZE = 256;

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadBytes(CHUNK_SIZE);
var b2 = br2.ReadBytes(CHUNK_SIZE);

fixed (byte* pB1 = b1, pB2 = b2)
{
for (int i = 0; i < b1.Length; i++)
{
if (pB1[i] != pB2[i])
return false;
}
}
}

return true;
}

string ComputeHash(Stream stream)
{
using var sha512 = SHA512.Create();

var hash = sha512.ComputeHash(stream);
return Convert.ToBase64String(hash);
}
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eOuter = fileInfos.GetEnumerator();

while (eOuter.MoveNext())
{
var fiOuter = eOuter.Current;

var skipped = 0;
using var eInner = fileInfo.GetEnumerator();

while (eInner.MoveNext())
{
if (skipped++ < skip)
continue;

var fiInner = eOuter.Current;

if (FileContentsEqual(fiOuter, fiInner))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
}
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

if (ComputeHash(s1) == ComputeHash(s2))
return true;

s1.Position = s2.Position = 0;

using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

const int CHUNK_SIZE = 256;

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadBytes(CHUNK_SIZE);
var b2 = br2.ReadBytes(CHUNK_SIZE);

fixed (byte* pB1 = b1, pB2 = b2)
{
for (int i = 0; i < b1.Length; i++)
{
if (pB1[i] != pB2[i])
return false;
}
}
}

return true;
}

string ComputeHash(Stream stream)
{
using var sha512 = SHA512.Create();

var hash = sha512.ComputeHash(stream);
return Convert.ToBase64String(hash);
}
Pepega ass code Don't know if that one using directive is legal On my phone lol Ah yeah that's fair Obviously Since it's smaller But just to make sure
Bbookuha9/13/2022
I can’t really dive into the solutions right now, driving home. Thank you Will check a bit later I am currently grouping things up by their size And then comparing files in the resulting buckets But my comparison approach is not good
Eero9/13/2022
Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes Rare
Bbookuha9/13/2022
Oh, yes. Thank you
Eero9/13/2022
Hah, yeah I won't be home for another 5 hours so it's difficult for me Ah i was assuming the steam directory Also different for Linux imagine You guys have like a billion empty config files
Bbookuha9/14/2022
Thank you guys!

Looking for more? Join the community!

Want results from more Discord servers?
Add your server
Recommended Posts
Sandcastle Help File Builder not building (or alternative)Using SHFB i managed a few weeks ago to build myself a config for a Project. The output produced wasWhat is the process for making SQL server open to remote connections?I've been making a program that works with a database on sql server, but for some reason if anyone e❔ How can I tell if an app has been opened?remembering that I use c# and win forms for design and finally i use net framework 4.7Serilog per sink and per log type loglevel overrideHi there, Was wondering if it was possible to set an override per log type for each sink specificalIs it possible to check moved or updated files in real time?I want to know how to check in real time files that have been moved or updated and get their md5 hasPass a value on a click eventhey, i am populating a stack panel with values from a data base, and i would like to have a button [.NET Core 6][HttpClient] Storing JWT in cacheHello, I am trying to figure out a way to store a JWT for a service to service network communicatioSignalR stream data to a specified client? [Answered]Is there a way to stream data to a specified client? ```cs public async Task TransferFileAsync(strHidden input fields and security concernsI was reading about the use of hidden input fields, and I came across with this post https://stackovReplacing font table in RichTextBox (WinForms)I'm working on a .NET Framework 4.8 app that uses WinForms and in there, `RichTextBox` is used. ThCall public methodHello, Im trying to create a "config" which retrieves values from the appsettings.json file. Ive cDebugging recursive algorithmsI'm practicing recursion in MSVS and I tend to get lost after following the first "branch" of "nodesAutomatic Formatting with MSVSI've googled this but none of the solutions seem to work? https://stackoverflow.com/questions/57559asp net core api response using composition over inheritancehi all, i come across an article creating a BaseResponse class for their API response, but the articReload the listHi, I have a language class. When I do OnChange, I will be able to reload the HTML text without any EF Migration errorhi, I am getting an error message, and I am guessing comes from the connection string Initial cataloget the address from pointer offsets.Trying to make a function to grab an address from pointer offsets. New to pointers and tried to follnatural sortHi guys, I need to do some kind of sort on a collection of strings... the actual strings are somethisomething is very wrong herefor (int i = 0; i <= 30;++)❔ tough homework 2" A series of random numbers is called "balanced" when the amount of positive numbers in the series