Speed Comparison

English support for the software AllDup
Post Reply
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Speed Comparison

Post by therube »

Speed Comparison - Lots of tiny files

Just a FWIW...

(Tests were run with 3.9.15b.)


these are ~...

120 top level directories
200 MB directory tree
80,000 files
6,500 folders

each directory is 1% or less of total size
average of 70 directories in each of the 120 top level directories
average of 777 files within each top level directory structure
all files are tiny, < 400,000 bytes
huge majority 79,000 files are < 16,000 bytes

while the file are on a HDD, disk access, was not a factor

so, in this case, all comparison methods (byte-by-byte vs differing hashes)
took essentially the same time to complete

that said, none were particularly efficient - in this instance
(likewise, file deletion seemed a bit inefficient too, taking 2:39 to delete 24,000 files?)


[0fav.ARJ]


byte-by-byte

Code: Select all

01/22/2016 01:09:30 PM - --------------------------------------------------
01/22/2016 01:09:30 PM - Search: File name + File size + File content Byte by Byte
01/22/2016 01:09:30 PM - Determine the file count of all source folders...
01/22/2016 01:09:31 PM - File Count: 80506
01/22/2016 01:09:31 PM - Scan: C:\out
01/22/2016 01:20:55 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 01:20:56 PM - Groups: 1,113
01/22/2016 01:20:56 PM - File Comparison Count: 79,479
01/22/2016 01:20:56 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 01:20:56 PM - Elapsed time: 00:11:26
md5

Code: Select all

01/22/2016 12:38:35 PM - --------------------------------------------------
01/22/2016 12:38:35 PM - Search: File name + File size + File content MD5 (128-Bit)
01/22/2016 12:38:35 PM - Determine the file count of all source folders...
01/22/2016 12:38:36 PM - File Count: 80506
01/22/2016 12:38:36 PM - Scan: C:\out
01/22/2016 12:50:05 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 12:50:06 PM - Groups: 1,113
01/22/2016 12:50:06 PM - File Comparison Count: 79,479
01/22/2016 12:50:06 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 12:50:06 PM - Elapsed time: 00:11:31
sha1:

Code: Select all

01/22/2016 12:19:58 PM - --------------------------------------------------
01/22/2016 12:19:58 PM - Search: File name + File size + File content SHA-1 (160-Bit)
01/22/2016 12:19:58 PM - Determine the file count of all source folders...
01/22/2016 12:19:59 PM - File Count: 80505
01/22/2016 12:19:59 PM - Scan: C:\out
01/22/2016 12:31:24 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 12:31:25 PM - Groups: 1,113
01/22/2016 12:31:25 PM - File Comparison Count: 79,479
01/22/2016 12:31:25 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 12:31:25 PM - Elapsed time: 00:11:27
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Re: Speed Comparison

Post by therube »

Speed Comparison - Lots of bigger files.

200 duplicates
2.5 GB
100 files @ 4 MB
50 files @ 17 MB
63 files @ 19 MB

In this instance, using a hash compared to byte-by-byte is night & day - hash wins.

I'll just quote what the book says:
As expected, the compare method SHA-1 wins the rally. Comparing only the checksums stored inside the RAM reduced the file read access and saved a lot of time.
Using byte-by-byte, disk file read was just horrendous.
sdfgdhfgh
Posts: 38
Joined: 02 Feb 2014, 17:36

Re: Speed Comparison

Post by sdfgdhfgh »

Too bad you dont have the logs for the "bigger" files - however, really big files are missing.
while the file are on a HDD, disk access, was not a factor
Why? With many tiny files on HDD disk access is a factor, as seek time >> (much larger) transfer time.....
If it is not a factor, maybe all your files were stored in RAM cache of the drive. so you basically compared the speed reading small files from RAM and reading hash values from RAM.....
Post Reply