Page 1 of 1

Speed Comparison

Posted: 23 Jan 2016, 02:30
by therube
Speed Comparison - Lots of tiny files

Just a FWIW...

(Tests were run with 3.9.15b.)


these are ~...

120 top level directories
200 MB directory tree
80,000 files
6,500 folders

each directory is 1% or less of total size
average of 70 directories in each of the 120 top level directories
average of 777 files within each top level directory structure
all files are tiny, < 400,000 bytes
huge majority 79,000 files are < 16,000 bytes

while the file are on a HDD, disk access, was not a factor

so, in this case, all comparison methods (byte-by-byte vs differing hashes)
took essentially the same time to complete

that said, none were particularly efficient - in this instance
(likewise, file deletion seemed a bit inefficient too, taking 2:39 to delete 24,000 files?)


[0fav.ARJ]


byte-by-byte

Code: Select all

01/22/2016 01:09:30 PM - --------------------------------------------------
01/22/2016 01:09:30 PM - Search: File name + File size + File content Byte by Byte
01/22/2016 01:09:30 PM - Determine the file count of all source folders...
01/22/2016 01:09:31 PM - File Count: 80506
01/22/2016 01:09:31 PM - Scan: C:\out
01/22/2016 01:20:55 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 01:20:56 PM - Groups: 1,113
01/22/2016 01:20:56 PM - File Comparison Count: 79,479
01/22/2016 01:20:56 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 01:20:56 PM - Elapsed time: 00:11:26
md5

Code: Select all

01/22/2016 12:38:35 PM - --------------------------------------------------
01/22/2016 12:38:35 PM - Search: File name + File size + File content MD5 (128-Bit)
01/22/2016 12:38:35 PM - Determine the file count of all source folders...
01/22/2016 12:38:36 PM - File Count: 80506
01/22/2016 12:38:36 PM - Scan: C:\out
01/22/2016 12:50:05 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 12:50:06 PM - Groups: 1,113
01/22/2016 12:50:06 PM - File Comparison Count: 79,479
01/22/2016 12:50:06 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 12:50:06 PM - Elapsed time: 00:11:31
sha1:

Code: Select all

01/22/2016 12:19:58 PM - --------------------------------------------------
01/22/2016 12:19:58 PM - Search: File name + File size + File content SHA-1 (160-Bit)
01/22/2016 12:19:58 PM - Determine the file count of all source folders...
01/22/2016 12:19:59 PM - File Count: 80505
01/22/2016 12:19:59 PM - Scan: C:\out
01/22/2016 12:31:24 PM - Found 79429 duplicates with 111,901,790 Bytes in source folder 'C:\out'
01/22/2016 12:31:25 PM - Groups: 1,113
01/22/2016 12:31:25 PM - File Comparison Count: 79,479
01/22/2016 12:31:25 PM - Duplicates: 79429 (98%) (106.72 MB)
01/22/2016 12:31:25 PM - Elapsed time: 00:11:27

Re: Speed Comparison

Posted: 27 Jan 2016, 00:36
by therube
Speed Comparison - Lots of bigger files.

200 duplicates
2.5 GB
100 files @ 4 MB
50 files @ 17 MB
63 files @ 19 MB

In this instance, using a hash compared to byte-by-byte is night & day - hash wins.

I'll just quote what the book says:
As expected, the compare method SHA-1 wins the rally. Comparing only the checksums stored inside the RAM reduced the file read access and saved a lot of time.
Using byte-by-byte, disk file read was just horrendous.

Re: Speed Comparison

Posted: 03 Aug 2016, 16:24
by sdfgdhfgh
Too bad you dont have the logs for the "bigger" files - however, really big files are missing.
while the file are on a HDD, disk access, was not a factor
Why? With many tiny files on HDD disk access is a factor, as seek time >> (much larger) transfer time.....
If it is not a factor, maybe all your files were stored in RAM cache of the drive. so you basically compared the speed reading small files from RAM and reading hash values from RAM.....