Are checksums persisted for similar image searches?
Are checksums persisted for similar image searches?
Since "Find similar pictures" searches can be very slow if you have a lot of files, I was wondering if the checksums created for the files being compared are persisted (in a database file, etc.), so that if these same files are included in future scans (even if AllDup is closed then relaunched on a later date), these checksums can be referenced more quickly versus re-creating them from the source files, which would make the total scanning time be faster during any subsequent scans that reference the same files.
As an example, when I do a similar image search using "Compare only files between different source folders", with the following two source folders:
-folder A: where there are no duplicates because they were already removed in previous AllDup sessions, therefore checksums had already been created for all of its images that were previously scanned using "Find similar pictures".
-folder B: where all of its images haven't been scanned previously.
In my example, it seems that the checksums in both folder A and folder B are being re-created because it takes just as long or longer than it took during the previous scan. In this case, new checksums should only need to be created for folder B, and the only reason any new checksums should be created from files in folder A is if folder A contains new files and/or subfolders that weren't previously scanned, i.e. checksums don't exist for the image file paths in the database.
Persisting checksums for other search methods that look for exact duplicates (such as MD5) would also be useful, but moreso for similar picture searches since they can take a lot longer to complete.
As an example, when I do a similar image search using "Compare only files between different source folders", with the following two source folders:
-folder A: where there are no duplicates because they were already removed in previous AllDup sessions, therefore checksums had already been created for all of its images that were previously scanned using "Find similar pictures".
-folder B: where all of its images haven't been scanned previously.
In my example, it seems that the checksums in both folder A and folder B are being re-created because it takes just as long or longer than it took during the previous scan. In this case, new checksums should only need to be created for folder B, and the only reason any new checksums should be created from files in folder A is if folder A contains new files and/or subfolders that weren't previously scanned, i.e. checksums don't exist for the image file paths in the database.
Persisting checksums for other search methods that look for exact duplicates (such as MD5) would also be useful, but moreso for similar picture searches since they can take a lot longer to complete.
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact:
Re: Are checksums persisted for similar image searches?
No, checksums will not be stored in a database because there is no foolproof method to make sure the content of the file was altered or not.
Re: Are checksums persisted for similar image searches?
How about if there was a section on the options screen for users that want to use a database to store past checksums, despite a risk of the checksums being stale? If it was optional to use it (unchecked by default), users could opt to use it only when they are scanning folders where they know the files wouldn't have changed since the last scan.
Just some things to think about if you ever decide to implement a checksum database:
Just some things to think about if you ever decide to implement a checksum database:
- If it was enabled in the options, the database could also have some forms of auto-maintenance, such as: at the start of each scan, AllDup could automatically remove checksums for any filepath entries in the database that no longer exist on the user's computer, so those checksums could be re-created during a future scan.
- Each filepath entry in the database could use multiple columns for each type of checksum that could potentially be created (dHash, pHash, MD5, etc). This way a scan for similar images wouldn't necessarily overwrite any existing checksums that were from a previous scan for exact duplicates, or vice versa.
- There could also be a Purge button on the options screen for users to manually clear the database entries.
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact:
Re: Are checksums persisted for similar image searches?
ok, i will add this to the ToDo-List.
another option will be:
"check file size and file modified date" - if one not match a new checksum will be created
another option will be:
"check file size and file modified date" - if one not match a new checksum will be created
Re: Are checksums persisted for similar image searches?
Thanks! Yes, if a file still exists at its path in the database, automatically checking its file size and modified date is also a good idea to detect if its checksums should be re-created.
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact:
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact:
Re: Are checksums persisted for similar image searches?
A new version with the database functionality is now available for download:
NEW: Now you can store the calculated checksums during a scan for duplicate files into a database. That has the advantage that further file scans requires considerably less time because of the reuse of the stored checksums.
NEW: Main Window / Toolbar: Added the button 'Database'.
NEW: Database: The storage of the checksums can be individually activated for the search methods 'File Content', 'Similar Pictures' and 'Similar Audio Files'.
NEW: Database: Option 'Recalculate checksum if the file size was changed'.
NEW: Database: Option 'Recalculate checksum if the file modified date was changed'.
NEW: Database / Tools: The button 'Cleanup' removes all checksums of files not exist anymore.
NEW: Database / Tools: The button 'Reset' removes all checksums at the database.
NEW: Database / Tools: The button 'New' deletes the current database file and creates a new database.
NEW: Search / Progress Details: Added a button to enlarge & reduce the size of the log window.
NEW: Main Window / Filter Lists / Context Menu: Added the command 'Add standard filters'.
NEW: Main Window / Archive Files: Now you can check/uncheck all archive types via a right mouse click on the option 'Scan the following archive types'.
UPD: The Portuguese translation of the user interface has been updated.
UPD: The Chinese translation of the user interface has been updated.
FIX: Various optimizations have been introduced in various sections of AllDup.
Re: Are checksums persisted for similar image searches?
Thx for adding the database functionality!
Search method: file size + file content
Comparison methode: MD5
I if run a comparison with these settings the first run takes a long time (i see in the statistics that a lot of checksums are begin created).
The second time I run this comparison it takes only a few seconds (no file has changed so the checksums in the database can be used and no file has to be read?).
Search method: file size + file content
Comparison methode: MD5 + ignore metadata of jpeg
When I run a comparison with these settings, the second run takes as long as the first.
I noticed that in the statistics almost no checksums are being created what explains why the second run takes as long as the first run.
What could be the reason for this?
Search method: file size + file content
Comparison methode: MD5
I if run a comparison with these settings the first run takes a long time (i see in the statistics that a lot of checksums are begin created).
The second time I run this comparison it takes only a few seconds (no file has changed so the checksums in the database can be used and no file has to be read?).
Search method: file size + file content
Comparison methode: MD5 + ignore metadata of jpeg
When I run a comparison with these settings, the second run takes as long as the first.
I noticed that in the statistics almost no checksums are being created what explains why the second run takes as long as the first run.
What could be the reason for this?
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact:
Re: Are checksums persisted for similar image searches?
If you ignore the metadata of file a new checksum has to be created for each file with metadata...
Re: Are checksums persisted for similar image searches?
Can't we assume that the data hasn't changed if the file size and the file modified timestamp are the same?
-
- Site Admin
- Posts: 4050
- Joined: 04 Oct 2004, 18:38
- Location: Thailand
- Contact: