Problem with some Unicode characters and Case sensitivity

English support for the software AllDup
Post Reply
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Problem with some Unicode characters and Case sensitivity

Post by apsen »

I have found two problems:

1: AllDup is unable to open/process files/paths with some Unicode characters
2: AllDup has trouble processing some folders that have case sensitivity enabled

I'm attaching an archive with the command file I used to create test case, the log of running it, the result of "dir /b /s" on the folder with the files/folders created and the result of running AllDup on that folder.

Note that the folder in which you would run the command file needs to have case sensitivity enabled with "fsutil.exe file setCaseSensitiveInfo C:\Workarea\AllDupTest enable". That needs Windows Linux subsystem installed which could be done from PowerShell by running "Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux". Also depending on the permissions of the folder the fsutil command may need to be run with administrative privileges.

If the test case would be used unmodified it will create 4 identical files for each tested character: two in the immediate directory and two in the subdirectory with name equal to the tested character. Of each pair of the files one would be named using the tested character and one using digits.

In total that creates 394 sets of identical files but AllDup only finds 189 sets of which one is incorrect as it contains 10 entries some of which are duplicate entries (i.e. not the files with identical content but with identical path/filename).

PS. Just in case: the command file is UTF-8 encoded so the command prompt needs corresponding code page selected.
Attachments
AllDupTestResults.zip
(41.29 KiB) Downloaded 589 times
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Problem with some Unicode characters and Case sensitivity

Post by Administrator »

AllDup doesnt differentiate file or folder names by case.
Windows and AllDup doesnt support the following characters in file or folder names:

/ forward slash
\ backslash
< less than
> greater than
: colon
" double quote
| vertical bar or pipe
? question mark
* asterisk
. period or a space at the end of a file or folder name

Here you will find all the Windows naming conventions for files and folders supported by windows apps: https://docs.microsoft.com/de-de/window ... ing-a-file

If you have some files with unicode characters in their names that cant be processed by AllDup please put them inside a RAR archive and post the file here.
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Re: Problem with some Unicode characters and Case sensitivity

Post by therube »

@apsen, are you also quuzuu?

2. what is the use case for using sensitivity (other then Linux does it)?
the command file I used to create test case
Heh. I didn't pick up on that the 1st or 2nd look through.
I "looked" & "saw" what basically looked like gibberish & I was like, OK.

Only after the fact did I realize it was a .cmd file I was looking at & it was using echo to create the files.

Prior to realizing that, I took dir-b-s.txt & mangled that (with Vim), & used echo to create (a batch file) to create the files.

Code: Select all

%s/C:\\Workarea\\AllDupTest\\/echo hi > /
I say "mangled", because I'm not sure if Vim is reading the file correctly (or writing its' changes correctly), & my echo's failed to create some of the file names (that had what looked like back-slashes) - much less knowing what (where & how) chcp should be set.


Running your .cmd similarly failed (partially) on my end (presumably because I didn't set chcp properly?).

Code: Select all

cp: cannot stat `n+ƒ.txt': No such file or directory
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Re: Problem with some Unicode characters and Case sensitivity

Post by therube »

. period or a space at the end of a file or folder name
Not quite sure that is "illegal". (OK, so they recommend against it.)
Might be awkward to work with (like maybe needing fsutil, was it?, or using other creativity to create the file), might be that many utilities will fail when running into them, but don't think they're "illegal".

Kind of like long file names. Easy enough to create, but once you've done so, many utilities, much of Windows itself (even Recycle Bin), fails to properly handle them.


Illegals.
(Oops, meant to post that in the trump-builds-a-wall forum ;-).)
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

Administrator wrote: 23 Jul 2020, 07:31 AllDup doesnt differentiate file or folder names by case.
Windows and AllDup doesnt support the following characters in file or folder names:

/ forward slash
\ backslash
< less than
> greater than
: colon
" double quote
| vertical bar or pipe
? question mark
* asterisk
. period or a space at the end of a file or folder name

Here you will find all the Windows naming conventions for files and folders supported by windows apps: https://docs.microsoft.com/de-de/window ... ing-a-file

If you have some files with unicode characters in their names that cant be processed by AllDup please put them inside a RAR archive and post the file here.
Those are not the the ASCII characters but a different Unicode characters that could be used in the file names. As a matter of fact all those characters are from the actual file names on my current Windows system. I have provided command file that could be run to create those files but if you prefer to have them in the RAR file I could run that command file again on my system and provide you with the resulting directory archive.
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

therube wrote: 23 Jul 2020, 17:42 @apsen, are you also quuzuu?
No. I searched the forum for that string to see why you would think that but apparently it does not search the user names. I'll see if I could find his posts some other way...
therube wrote: 23 Jul 2020, 17:42 2. what is the use case for using sensitivity (other then Linux does it)?
the command file I used to create test case
Heh. I didn't pick up on that the 1st or 2nd look through.
I "looked" & "saw" what basically looked like gibberish & I was like, OK.

Only after the fact did I realize it was a .cmd file I was looking at & it was using echo to create the files.

Prior to realizing that, I took dir-b-s.txt & mangled that (with Vim), & used echo to create (a batch file) to create the files.

Code: Select all

%s/C:\\Workarea\\AllDupTest\\/echo hi > /
I say "mangled", because I'm not sure if Vim is reading the file correctly (or writing its' changes correctly), & my echo's failed to create some of the file names (that had what looked like back-slashes) - much less knowing what (where & how) chcp should be set.


Running your .cmd similarly failed (partially) on my end (presumably because I didn't set chcp properly?).

Code: Select all

cp: cannot stat `n+ƒ.txt': No such file or directory
The command file is UTF-8 encoded and you need to have 'chcp 65001' to run it correctly. I'm pretty sure I have mentioned that. Plus currently it needs case sensitivity enabled to be run in its entirety. I'll modify it so it does not need case sensitivity and upload modified version later.

As to the case of using case sensitivity is that I sometimes need to unpack archives created on the Unix and those could have names that differ only by case. And I would like to be able to use AllDup on those as well. Besides as far as I understand as long as you do not modify the character case of the file/path it should work transparently to the program. At least the programs I use have no trouble with those.
Last edited by apsen on 24 Jul 2020, 00:25, edited 1 time in total.
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

Updated test in a RAR archive. This test case does not require case sensitivity and will not expose the related problems.
Attachments
AllDupTest.rar
(141.46 KiB) Downloaded 614 times
therube
Posts: 322
Joined: 07 Nov 2012, 00:28

Re: Problem with some Unicode characters and Case sensitivity

Post by therube »

I still must be doing something wrong, not understanding enough.
With .v2 I get 1..394 directories, 1512 items (so including subdirectories) in total, but still I'm coming up with errors.
(I'm not really familiar with such things, so I'll say my system is plain-jane en-US, with whatever is defaulted with that, if that matters.)

Code: Select all

C:\TMP\SEA\alldup\4>mkdir 394   && mkdirThe system cannot write to the specified device.
 && echoThe system cannot write to the specified device.
1>The system cannot write to the specified device.
 && cpThe system cannot write to the specified device.
 && cpThe system cannot write to the specified device.
 && cpThe system cannot write to the specified device.

cp: accessing `394\\?\\?.txt': Invalid argument

C:\TMP\SEA\alldup\4>chcp
Active code page: 65001

@apsen is a user of a different utility (not here) who asked a very similar question (in concept) either the day before or the same day as you, so thought there might be some connection, but obviously not.

Code: Select all

07/23/2020 08:57:49 PM - INFO: Unable to detect the VLC Media Player 32-bit version 3 on your system
07/23/2020 08:57:49 PM - --------------------------------------------------
07/23/2020 08:57:49 PM - AllDup 4.4.34 PE
07/23/2020 08:57:49 PM - Search method: File content
07/23/2020 08:57:49 PM - Comparison method: SHA-1 (160-Bit)
07/23/2020 08:57:49 PM - 1.Source folder: C:\TMP\SEA\alldup\4
07/23/2020 08:57:49 PM - Option: Compare files from all source folders
07/23/2020 08:57:49 PM - Folder filter activated: 7
07/23/2020 08:57:49 PM - Filter type: Exclusive
07/23/2020 08:57:49 PM - 1.folder filter: e:\windows
07/23/2020 08:57:49 PM - 2.folder filter: e:\program files (x86)
07/23/2020 08:57:49 PM - 3.folder filter: e:\program files
07/23/2020 08:57:49 PM - 4.folder filter: ?:\system volume information
07/23/2020 08:57:49 PM - 5.folder filter: ?:\recycled
07/23/2020 08:57:49 PM - 6.folder filter: ?:\recycler
07/23/2020 08:57:49 PM - 7.folder filter: ?:\$recycle.bin
07/23/2020 08:57:49 PM - Determine file count of all source folders...
07/23/2020 08:57:50 PM - File count: 725
07/23/2020 08:57:50 PM - Scan: C:\TMP\SEA\alldup\4
07/23/2020 08:57:50 PM - Found 440 duplicates with a total of 3.50 KB inside folder 'C:\TMP\SEA\alldup\4'
07/23/2020 08:57:50 PM - Scanned files: 725
07/23/2020 08:57:50 PM - Groups: 110
07/23/2020 08:57:50 PM - File comparison count: 74,076
07/23/2020 08:57:50 PM - Checksums created: 724
07/23/2020 08:57:50 PM - Duplicates: 440 (60%) (3.50 KB)
07/23/2020 08:57:50 PM - Elapsed time: 00:00:01
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

therube wrote: 24 Jul 2020, 07:54 I still must be doing something wrong, not understanding enough.
With .v2 I get 1..394 directories, 1512 items (so including subdirectories) in total, but still I'm coming up with errors.
(I'm not really familiar with such things, so I'll say my system is plain-jane en-US, with whatever is defaulted with that, if that matters.)

Code: Select all

C:\TMP\SEA\alldup\4>mkdir 394   && mkdirThe system cannot write to the specified device.
 && echoThe system cannot write to the specified device.
1>The system cannot write to the specified device.
 && cpThe system cannot write to the specified device.
 && cpThe system cannot write to the specified device.
 && cpThe system cannot write to the specified device.

cp: accessing `394\\?\\?.txt': Invalid argument

C:\TMP\SEA\alldup\4>chcp
Active code page: 65001
I have no idea what maybe going wrong on your side... If you like I would be open to some kind of screen sharing via Skype or Discord... Your filesystem is NTFS, right? Are you doing it on windows 10?

I open the cmd file with notepad and copy/paste couple of lines to test on my side and here is what I get:

Code: Select all

Microsoft Windows [Version 10.0.19041.388]
(c) 2020 Microsoft Corporation. All rights reserved.

C:\Workarea\AllDupTest>mkdir test

C:\Workarea\AllDupTest>cd test

C:\Workarea\AllDupTest\test>mkdir 4 && mkdir 4\¡ && echo "¡" > 4\¡.txt && cp 4\¡.txt 4\¡\¡.txt && cp 4\¡.txt 4\4.txt && cp 4\¡.txt 4\¡\4.txt

C:\Workarea\AllDupTest\test>mkdir 1 && mkdir 1\ && echo "" > 1\.txt && cp 1\.txt 1\\.txt && cp 1\.txt 1\1.txt && cp 1\.txt 1\\1.txt

C:\Workarea\AllDupTest\test>chcp
Active code page: 65001

C:\Workarea\AllDupTest\test>
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Problem with some Unicode characters and Case sensitivity

Post by Administrator »

my search result with your files from the AllDupTest.rar:
24.07.2020 13:09:17 - AllDup 4.4.34 PE
24.07.2020 13:09:17 - Search method: File size + File content
24.07.2020 13:09:17 - Comparison method: Byte by Byte
24.07.2020 13:09:17 - Match: 100%
24.07.2020 13:09:17 - 1.Source folder: D:\AllDupTest
24.07.2020 13:09:17 - Option: Compare files from all source folders
24.07.2020 13:09:17 - Determine file count of all source folders...
24.07.2020 13:09:17 - File count: 1.581
24.07.2020 13:09:17 - Scan: D:\AllDupTest
24.07.2020 13:09:48 - Found 1.576 duplicates with a total of 13,16 KB inside folder 'D:\AllDupTest'
24.07.2020 13:09:48 - Scanned files: 1.581
24.07.2020 13:09:48 - Groups: 394
24.07.2020 13:09:48 - File comparison count: 157.230
24.07.2020 13:09:48 - Duplicates: 1.576 (99%) (13,16 KB)
24.07.2020 13:09:48 - Elapsed time: 00:00:31
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

Administrator wrote: 24 Jul 2020, 13:12 my search result with your files from the AllDupTest.rar:
24.07.2020 13:09:17 - AllDup 4.4.34 PE
24.07.2020 13:09:48 - Groups: 394
Could this be related to "Beta: Use Unicode UTF-8 for worldwide language support" setting? This should not matter if only Unicode (W) version of Windows API is used but perhaps somewhere in the app "ANSI" (A) version is used...

And most likely fixing this problem will also fix the case sensitivity unless there's explicit case conversion in the app.
Administrator
Site Admin
Posts: 4046
Joined: 04 Oct 2004, 18:38
Location: Thailand
Contact:

Re: Problem with some Unicode characters and Case sensitivity

Post by Administrator »

apsen wrote: 24 Jul 2020, 20:11Could this be related to "Beta: Use Unicode UTF-8 for worldwide language support" setting?
Turn it off and u will see...
apsen
Posts: 11
Joined: 21 Jul 2020, 22:13

Re: Problem with some Unicode characters and Case sensitivity

Post by apsen »

Administrator wrote: 24 Jul 2020, 20:39
apsen wrote: 24 Jul 2020, 20:11Could this be related to "Beta: Use Unicode UTF-8 for worldwide language support" setting?
Turn it off and u will see...
Can't do it right now as I have a long running process that I'd rather not interrupt... But that could be tested on your side too.
Post Reply