I
Immich2y ago
nkdf.

Duplicate detection not working as expected

I am running v1.60.0 on docker. I recently uploaded ~40000 photos using the bulk upload / CLI from my google photos, expecting that duplicate detection would prevent the lower quality google photos from being uploaded where I had an original quality version of the same photo. The filenames also match. Did I misunderstand the duplicate detection feature? Is there a way to tell Immich to only keep the larger file of the same name? Thanks
25 Replies
bo0tzz
bo0tzz2y ago
Duplicate detection only operates on identical files, not (yet) on different versions of one
nkdf.
nkdf.OP2y ago
oh crap.. I screwed up then If I remove the duplicate from the filesystem, I assume I would have to clean up the database and artifacts somehow?
bo0tzz
bo0tzz2y ago
Yeah, there's not really an easy way to clean up this situation unfortunately. If you've only just gotten started with Immich, the easiest is to just wipe it and begin again.
nkdf.
nkdf.OP2y ago
unfortunately I'm somewhat OG... been on it since the project started and was going to go full commit lol
iriche
iriche2y ago
You can always remove them from filesystem do a script that fetches all in db, checks if file exist, if not then delete
nkdf.
nkdf.OP2y ago
is the db schema somewhere I can reference? although I don't know if I have the scripting skills to attempt.. but may be worth a read
iriche
iriche2y ago
You don't need the DB-schema just use the API @nkdf. https://discord.com/channels/979116623879368755/1117535406259449956 here is my small script i did to remove stuff that was from partnersharing, you can probably reuse some of it and do another check instead of the one I do with JSON. Question is though how will immich handle a delete action on a non existing file.
bo0tzz
bo0tzz2y ago
There is no sql schema, the entity definitions is probably what you want https://github.com/immich-app/immich/tree/main/server/src/infra/entities But indeed, you'd probably be better off scripting against the API That way all the other bits are handled for you
nkdf.
nkdf.OP2y ago
interesting.. ok I'll read up
bo0tzz
bo0tzz2y ago
The autogenerated api docs are at https://immich.app/docs/api
iriche
iriche2y ago
But as I pointed out, the biggest issue you will have there is how Immich handles the errors when a file doesn't exist on deletion. If it fails gracefully and does everything else except deleting the original file.
nkdf.
nkdf.OP2y ago
Maybe I'll try to enumerate a list of files to be deleted and pass them into the api that way instead of cleaning out the filesystem first is using 'searchAsset' the only way to return an asset id by searching the filename?
iriche
iriche2y ago
Just do getallassets and use that data
nkdf.
nkdf.OP2y ago
40k assets? its a small server
iriche
iriche2y ago
Yes, I had around 60k assets and did getall Took a few seconds
nkdf.
nkdf.OP2y ago
I just realized I can't compare the results of getallassets since 'resolution' returned is all based on exif data, and there is no information on actual file size I might need to somehow take the assets and query the filesystem or something... nevermind.. filesizeinbytes task completed. happy to share the script if anyone else runs into the issue
iriche
iriche2y ago
You could share it here, perfect reference for people
nkdf.
nkdf.OP2y ago
it's too long.. discord truncates it.. do we have a perferred place or will a pastebin do?
iriche
iriche2y ago
You can upload it as a file
bo0tzz
bo0tzz2y ago
Or just put it on github
nkdf.
nkdf.OP2y ago
Here is my python script that I used to detect duplicate images that resulted from uploading the contents of a google takeout where the images were stored in 'storage saver quality' when I already had original quality images from my device. I have tested the script on my installation of 40k assets, approximately 2800 were duplicates. Usual disclaimers apply, no guarantees are made. I have no coding skills, and I cannot hello world, I made this with the assistance of chatgpt.
It requires the output of getAllAssets from immich api: https://immich.app/docs/api/get-all-assets stored into a file called output.json The script then compares the originalFileName to identify duplicates. It then further ensures the duplicate filenames share the same exifInfo properties make, model, and dateTimeOriginal (in case you have multiple devices that duplicate filenames). Then it compares the fileSizeInBytes, and returns the smaller files. The final output is a comma delimited list of ids that you can feed into the deleteAsset api endpoint to remove these files from immich.
leaks-repaired
I was thinking of writing a similar script, except with a focus on pre-exclusion (move the duplicate photos from the takeout folder to some other folder, before pointing Immich's upload utility to it). Would you be interested in starting an immich-scripts git repo or something?
nkdf.
nkdf.OP2y ago
I don't know anything about managing a github repo, maybe one of the regular contributors would be a better person to ask how were you planning to identify duplicate photos? Part of the reason why my script ended so big was that the iphone xr and my canon eos wrote the same filenames
leaks-repaired
I guess I'll do that myself, creating a new repo bit. Mind if I base my script off of yours?
nkdf.
nkdf.OP2y ago
I don't mind, utilize it however you wish

Did you find this page helpful?