Apache OpenOffice (AOO) Bugzilla – Issue 75769
normalize strings for comparison, sorting, filenames, etc.
Last modified: 2014-02-24 16:51:52 UTC
Unicode strings should be normalized. Unicode defines canonically equivalent sequences of characters. For example these are equivalent: ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301 COMBINING ACUTE ACCENT> ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323 COMBINING DOT BELOW> ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE ACCENT> These should be considered as the same in searches. If one is in a file and the user has type an equivalent one as a query, they should match. They should be considered the same when files are created/saved. So if a file with on sequence already exist, saving a file with an equivalent name should behave as if it was the same bitwise name. Filenames should probably be normalized with NFC for legacy systems.
Created attachment 43957 [details] sample file with precomposed and composed equivalent strings
Created attachment 43958 [details] sample file with precomposed and composed equivalent strings
Not a defect but might be a wish for an enhancement. TM->requirements: please have a look.
As per Unicode standards (see http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G29705) programs should normalize Unicode data so pre-composed and de-composed forms are canonically equivalent. In addition to what moyogo stated about search (and search and replace) and filenames, a failure to normalize also affects spell-checking (users can use pre-composed or de-composed forms but the spell-checker may only be equipped to handle one or the other).
attachment has only word ecole twice... ?
(In reply to Edwin Sharp from comment #5) > attachment has only word ecole twice... ? Yes. However, one occurrence uses a precomposed character, the other uses combining diacritics (two Unicode characters that are canonically equivalent to the precomposed character). As it currently stands, a search will only ever find one of the occurrences (either the precomposed or composed version, depending on whether you use a precomposed character in your search string, or compose it with combining diacritics). According to Unicode standards, strings should be normalized so that precomposed and composed strings are canonically equivalent. So, according to Unicode standards, searching using either a precomposed or composed characters should find both occurrences in the sample document. This isn't the case in OpenOffice, though.
Thank you allez_les_lions