Slow hash-based duplicate detection

English support for the software AllDup
Post Reply
sdfgdhfgh
Posts: 38
Joined: 02 Feb 2014, 17:36

Slow hash-based duplicate detection

Post by sdfgdhfgh »

Hi,

I have a folder of 1.1 Mio Files with 2.4 TB. I used Alldup checking for content duplicates using SHA1 for 48h and it is at 44%. It adds only 1 or 2 new hashed per second now and has 27 billion comparisons done.
I paused AllDup and started creating checksums of all those files, which took less than 9 hours.
I sorted the file with the checksums - the lines are ordered by hash value now and if two (now consecutive) lines have the same hash, the file is identical.

Why does AllDup take so much longer for this task?

thx
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Slow hash-based duplicate detection

Post by Administrator »

because the search through large lists makes everything slower with every new item added...
sdfgdhfgh
Posts: 38
Joined: 02 Feb 2014, 17:36

Re: Slow hash-based duplicate detection

Post by sdfgdhfgh »

Wouldn't it be useful to not sort/compare the list for every new hash added?
Maybe it would be useful to have an option that you do only sort/compare the lists every X minutes (or at the end or user request/abort) ?
(not every x files, because that may be quite a long time depending on the file sizes)
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Slow hash-based duplicate detection

Post by Administrator »

No, that is not possible, because we dont work with a flat list like you did.
sdfgdhfgh
Posts: 38
Joined: 02 Feb 2014, 17:36

Re: Slow hash-based duplicate detection

Post by sdfgdhfgh »

So why not use a list and update the internal more complex data structure only every x minutes?
This way
a) reading/hashing and internal data organization can overlap in time.
b) the reorganization of the complex data structure is less often, which should save a lot of computing time
c) only about 10MB RAM per 100.000. files are required additionally (assuming in x minutes there are 100.000s of new hashes at all)
eulfnb
Posts: 18
Joined: 01 Apr 2023, 17:32

Re: Slow hash-based duplicate detection

Post by eulfnb »

I may have the same problem. However, I only have 855000 files. When I look at the task manager, neither the SSD (0 MB/s) nor the CPU (20%) nor the RAM (41%) are being utilized. There seems to be another problem why it is so slow. One problem seems to be that AllDub is not able to use CPU multithreading.

Have seen its on the TODO List since 2005:
https://www.allsync.biz/phpBB/viewtopic.php?p=927#p927
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Re: Slow hash-based duplicate detection

Post by therube »

From what I can make of things, it is more the number of files rather then anything to do with "multithreading".
eulfnb
Posts: 18
Joined: 01 Apr 2023, 17:32

Re: Slow hash-based duplicate detection

Post by eulfnb »

But what is the limitation if the computer hardware is nearly idle?
Post Reply