Are checksums persisted for similar image searches?

English support for the software AllDup
Post Reply
JR147
Posts: 14
Joined: 24 May 2019, 06:58

Are checksums persisted for similar image searches?

Post by JR147 »

Since "Find similar pictures" searches can be very slow if you have a lot of files, I was wondering if the checksums created for the files being compared are persisted (in a database file, etc.), so that if these same files are included in future scans (even if AllDup is closed then relaunched on a later date), these checksums can be referenced more quickly versus re-creating them from the source files, which would make the total scanning time be faster during any subsequent scans that reference the same files.

As an example, when I do a similar image search using "Compare only files between different source folders", with the following two source folders:
-folder A: where there are no duplicates because they were already removed in previous AllDup sessions, therefore checksums had already been created for all of its images that were previously scanned using "Find similar pictures".
-folder B: where all of its images haven't been scanned previously.

In my example, it seems that the checksums in both folder A and folder B are being re-created because it takes just as long or longer than it took during the previous scan. In this case, new checksums should only need to be created for folder B, and the only reason any new checksums should be created from files in folder A is if folder A contains new files and/or subfolders that weren't previously scanned, i.e. checksums don't exist for the image file paths in the database.

Persisting checksums for other search methods that look for exact duplicates (such as MD5) would also be useful, but moreso for similar picture searches since they can take a lot longer to complete.
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

No, checksums will not be stored in a database because there is no foolproof method to make sure the content of the file was altered or not.
JR147
Posts: 14
Joined: 24 May 2019, 06:58

Re: Are checksums persisted for similar image searches?

Post by JR147 »

How about if there was a section on the options screen for users that want to use a database to store past checksums, despite a risk of the checksums being stale? If it was optional to use it (unchecked by default), users could opt to use it only when they are scanning folders where they know the files wouldn't have changed since the last scan.

Just some things to think about if you ever decide to implement a checksum database:
  • If it was enabled in the options, the database could also have some forms of auto-maintenance, such as: at the start of each scan, AllDup could automatically remove checksums for any filepath entries in the database that no longer exist on the user's computer, so those checksums could be re-created during a future scan.
  • Each filepath entry in the database could use multiple columns for each type of checksum that could potentially be created (dHash, pHash, MD5, etc). This way a scan for similar images wouldn't necessarily overwrite any existing checksums that were from a previous scan for exact duplicates, or vice versa.
  • There could also be a Purge button on the options screen for users to manually clear the database entries.
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

ok, i will add this to the ToDo-List.

another option will be:

"check file size and file modified date" - if one not match a new checksum will be created
JR147
Posts: 14
Joined: 24 May 2019, 06:58

Re: Are checksums persisted for similar image searches?

Post by JR147 »

Thanks! Yes, if a file still exists at its path in the database, automatically checking its file size and modified date is also a good idea to detect if its checksums should be re-created.
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

Coming soon:
Screenshot - 04.07.2021 , 13_32_03.png
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

A new version with the database functionality is now available for download:
NEW: Now you can store the calculated checksums during a scan for duplicate files into a database. That has the advantage that further file scans requires considerably less time because of the reuse of the stored checksums.
NEW: Main Window / Toolbar: Added the button 'Database'.
NEW: Database: The storage of the checksums can be individually activated for the search methods 'File Content', 'Similar Pictures' and 'Similar Audio Files'.
NEW: Database: Option 'Recalculate checksum if the file size was changed'.
NEW: Database: Option 'Recalculate checksum if the file modified date was changed'.
NEW: Database / Tools: The button 'Cleanup' removes all checksums of files not exist anymore.
NEW: Database / Tools: The button 'Reset' removes all checksums at the database.
NEW: Database / Tools: The button 'New' deletes the current database file and creates a new database.
NEW: Search / Progress Details: Added a button to enlarge & reduce the size of the log window.
NEW: Main Window / Filter Lists / Context Menu: Added the command 'Add standard filters'.
NEW: Main Window / Archive Files: Now you can check/uncheck all archive types via a right mouse click on the option 'Scan the following archive types'.
UPD: The Portuguese translation of the user interface has been updated.
UPD: The Chinese translation of the user interface has been updated.
FIX: Various optimizations have been introduced in various sections of AllDup.
Evds
Posts: 10
Joined: 06 Sep 2015, 16:37

Re: Are checksums persisted for similar image searches?

Post by Evds »

Thx for adding the database functionality!

Search method: file size + file content
Comparison methode: MD5

I if run a comparison with these settings the first run takes a long time (i see in the statistics that a lot of checksums are begin created).
The second time I run this comparison it takes only a few seconds (no file has changed so the checksums in the database can be used and no file has to be read?).

Search method: file size + file content
Comparison methode: MD5 + ignore metadata of jpeg

When I run a comparison with these settings, the second run takes as long as the first.
I noticed that in the statistics almost no checksums are being created what explains why the second run takes as long as the first run.
What could be the reason for this?
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

If you ignore the metadata of file a new checksum has to be created for each file with metadata...
Evds
Posts: 10
Joined: 06 Sep 2015, 16:37

Re: Are checksums persisted for similar image searches?

Post by Evds »

Can't we assume that the data hasn't changed if the file size and the file modified timestamp are the same?
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Are checksums persisted for similar image searches?

Post by Administrator »

Evds wrote: 06 Aug 2021, 01:09 Can't we assume that the data hasn't changed if the file size and the file modified timestamp are the same?
creating a checksum of a file and creating a checksum of file ignoring the metadata results in two different checksums...
Post Reply