Duplicate detection not working as expected
I am running v1.60.0 on docker. I recently uploaded ~40000 photos using the bulk upload / CLI from my google photos, expecting that duplicate detection would prevent the lower quality google photos from being uploaded where I had an original quality version of the same photo. The filenames also match. Did I misunderstand the duplicate detection feature? Is there a way to tell Immich to only keep the larger file of the same name?
Thanks
25 Replies
Duplicate detection only operates on identical files, not (yet) on different versions of one
oh crap.. I screwed up then
If I remove the duplicate from the filesystem, I assume I would have to clean up the database and artifacts somehow?
Yeah, there's not really an easy way to clean up this situation unfortunately. If you've only just gotten started with Immich, the easiest is to just wipe it and begin again.
unfortunately I'm somewhat OG... been on it since the project started
and was going to go full commit lol
You can always remove them from filesystem do a script that fetches all in db, checks if file exist, if not then delete
is the db schema somewhere I can reference?
although I don't know if I have the scripting skills to attempt.. but may be worth a read
You don't need the DB-schema just use the API
@nkdf. https://discord.com/channels/979116623879368755/1117535406259449956 here is my small script i did to remove stuff that was from partnersharing, you can probably reuse some of it and do another check instead of the one I do with JSON.
Question is though how will immich handle a delete action on a non existing file.
There is no sql schema, the entity definitions is probably what you want https://github.com/immich-app/immich/tree/main/server/src/infra/entities
But indeed, you'd probably be better off scripting against the API
That way all the other bits are handled for you
interesting.. ok I'll read up
The autogenerated api docs are at https://immich.app/docs/api
Introduction | Immich
Immich API
But as I pointed out, the biggest issue you will have there is how Immich handles the errors when a file doesn't exist on deletion. If it fails gracefully and does everything else except deleting the original file.
Maybe I'll try to enumerate a list of files to be deleted and pass them into the api that way
instead of cleaning out the filesystem first
is using 'searchAsset' the only way to return an asset id by searching the filename?
Just do getallassets and use that data
40k assets? its a small server
Yes, I had around 60k assets and did getall
Took a few seconds
I just realized I can't compare the results of getallassets since 'resolution' returned is all based on exif data, and there is no information on actual file size
I might need to somehow take the assets and query the filesystem or something...
nevermind.. filesizeinbytes
task completed. happy to share the script if anyone else runs into the issue
You could share it here, perfect reference for people
it's too long.. discord truncates it.. do we have a perferred place or will a pastebin do?
You can upload it as a file
Or just put it on github
Here is my python script that I used to detect duplicate images that resulted from uploading the contents of a google takeout where the images were stored in 'storage saver quality' when I already had original quality images from my device. I have tested the script on my installation of 40k assets, approximately 2800 were duplicates. Usual disclaimers apply, no guarantees are made. I have no coding skills, and I cannot hello world, I made this with the assistance of chatgpt.
It requires the output of getAllAssets from immich api: https://immich.app/docs/api/get-all-assets stored into a file called output.json The script then compares the originalFileName to identify duplicates. It then further ensures the duplicate filenames share the same exifInfo properties make, model, and dateTimeOriginal (in case you have multiple devices that duplicate filenames). Then it compares the fileSizeInBytes, and returns the smaller files. The final output is a comma delimited list of ids that you can feed into the deleteAsset api endpoint to remove these files from immich.
It requires the output of getAllAssets from immich api: https://immich.app/docs/api/get-all-assets stored into a file called output.json The script then compares the originalFileName to identify duplicates. It then further ensures the duplicate filenames share the same exifInfo properties make, model, and dateTimeOriginal (in case you have multiple devices that duplicate filenames). Then it compares the fileSizeInBytes, and returns the smaller files. The final output is a comma delimited list of ids that you can feed into the deleteAsset api endpoint to remove these files from immich.
I was thinking of writing a similar script, except with a focus on pre-exclusion (move the duplicate photos from the takeout folder to some other folder, before pointing Immich's upload utility to it). Would you be interested in starting an
immich-scripts
git repo or something?I don't know anything about managing a github repo, maybe one of the regular contributors would be a better person to ask
how were you planning to identify duplicate photos? Part of the reason why my script ended so big was that the iphone xr and my canon eos wrote the same filenames
I guess I'll do that myself, creating a new repo bit.
Mind if I base my script off of yours?
I don't mind, utilize it however you wish