Suggestion: Idea to improve scanning time when finding exact duplicates
Posted: 06 Jun 2021, 07:58
I ran AllDup looking for exact duplicates in a source folder containing around 70,000 mostly small files, each sized from 5-10 KB on average. I did a "File content" search and it completed in around 40 minutes using the MD5 method. Then as a test I ran a "File size" search and it completed very fast (3 seconds). When I tried combining these two methods into one search, however, it still took around 40 minutes, so I'm guessing that MD5 checksums are still being created for all of the files in this case.
My idea is this: Since a File content search is much slower than a File size search, if both "File size" and "File content" are selected as the search methods, for source folders that may not contain a lot of duplicates, I think it could significantly reduce the scanning time of the File content search portion if checksums were only created for the files whose file sizes match, because if the sizes are different we know they aren't exact duplicates and therefore can skip the comparing of those files. In other words, this would allow for a fast fail (check the sizes first, and if the sizes are different, you know that the files are different). So in this case AllDup would only create checksums among the files whose sizes match, i.e. proceed with the File content search only among those files.
Note: I'm not sure if this improvement has already been made in the newer AllDup versions. I'm on AllDup version 4.4.44, running Windows 10 64-bit (I had previously upgraded to AllDup v4.4.47, but I had an issue with hard links not being created in that version so I downgraded back to version 4.4.44.)
My idea is this: Since a File content search is much slower than a File size search, if both "File size" and "File content" are selected as the search methods, for source folders that may not contain a lot of duplicates, I think it could significantly reduce the scanning time of the File content search portion if checksums were only created for the files whose file sizes match, because if the sizes are different we know they aren't exact duplicates and therefore can skip the comparing of those files. In other words, this would allow for a fast fail (check the sizes first, and if the sizes are different, you know that the files are different). So in this case AllDup would only create checksums among the files whose sizes match, i.e. proceed with the File content search only among those files.
Note: I'm not sure if this improvement has already been made in the newer AllDup versions. I'm on AllDup version 4.4.44, running Windows 10 64-bit (I had previously upgraded to AllDup v4.4.47, but I had an issue with hard links not being created in that version so I downgraded back to version 4.4.44.)