Issue 75769

Summary: normalize strings for comparison, sorting, filenames, etc.
Product: General Reporter: moyogo <moyogo>
Component: uiAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Minor    
Priority: P3 CC: allez_les_lions, elish, issues, olivier.noreply
Version: OOo 1.0.0   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: ENHANCEMENT Latest Confirmation in: 4.1.0-dev
Developer Difficulty: ---
Attachments:
Description Flags
sample file with precomposed and composed equivalent strings
none
sample file with precomposed and composed equivalent strings none

Description moyogo 2007-03-27 08:43:24 UTC
Unicode strings should be normalized.

Unicode defines canonically equivalent sequences of characters.
For example these are equivalent:
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301
COMBINING ACUTE ACCENT>
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323
COMBINING DOT BELOW>
ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE
ACCENT>

These should be considered as the same in searches. If one is in a file and the
user has type an equivalent one as a query, they should match.

They should be considered the same when files are created/saved. So if a file
with on sequence already exist, saving a file with an equivalent name should
behave as if it was the same bitwise name.

Filenames should probably be normalized with NFC for legacy systems.
Comment 1 moyogo 2007-03-27 09:15:44 UTC
Created attachment 43957 [details]
sample file with precomposed and composed equivalent strings
Comment 2 moyogo 2007-03-27 09:16:18 UTC
Created attachment 43958 [details]
sample file with precomposed and composed equivalent strings
Comment 3 thorsten.martens 2007-03-27 09:38:15 UTC
Not a defect but might be a wish for an enhancement.

TM->requirements: please have a look.
Comment 4 allez_les_lions 2010-03-20 14:29:31 UTC
As per Unicode standards (see
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G29705) programs should
normalize Unicode data so pre-composed and de-composed forms are canonically
equivalent. In addition to what moyogo stated about search (and search and
replace) and filenames, a failure to normalize also affects spell-checking
(users can use pre-composed or de-composed forms but the spell-checker may only
be equipped to handle one or the other).
Comment 5 Edwin Sharp 2014-02-24 13:52:59 UTC
attachment has only word ecole twice... ?
Comment 6 allez_les_lions 2014-02-24 16:01:30 UTC
(In reply to Edwin Sharp from comment #5)
> attachment has only word ecole twice... ?

Yes. However, one occurrence uses a precomposed character, the other uses combining diacritics (two Unicode characters that are canonically equivalent to the precomposed character). As it currently stands, a search will only ever find one of the occurrences (either the precomposed or composed version, depending on whether you use a precomposed character in your search string, or compose it with combining diacritics). According to Unicode standards, strings should be normalized so that precomposed and composed strings are canonically equivalent. So, according to Unicode standards, searching using either a precomposed or composed characters should find both occurrences in the sample document. This isn't the case in OpenOffice, though.
Comment 7 Edwin Sharp 2014-02-24 16:51:52 UTC
Thank you allez_les_lions