Issue 75769

Summary:

normalize strings for comparison, sorting, filenames, etc.

Product:

General

Reporter:

moyogo <moyogo>

Component:

Assignee:

AOO issues mailing list <issues>

Status:

CONFIRMED ---

QA Contact:

Severity:

Minor

Priority:

CC:

allez_les_lions, elish, issues, olivier.noreply

Version:

OOo 1.0.0

Target Milestone:

---

Hardware:

All

OS:

All

Issue Type:

ENHANCEMENT

Latest Confirmation in:

4.1.0-dev

Developer Difficulty:

---

Attachments:

Description	Flags
sample file with precomposed and composed equivalent strings	none
sample file with precomposed and composed equivalent strings	none

Description moyogo 2007-03-27 08:43:24 UTC

Unicode strings should be normalized.

Unicode defines canonically equivalent sequences of characters.
For example these are equivalent:
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301
COMBINING ACUTE ACCENT>
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323
COMBINING DOT BELOW>
ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE
ACCENT>

These should be considered as the same in searches. If one is in a file and the
user has type an equivalent one as a query, they should match.

They should be considered the same when files are created/saved. So if a file
with on sequence already exist, saving a file with an equivalent name should
behave as if it was the same bitwise name.

Filenames should probably be normalized with NFC for legacy systems.

Comment 1 moyogo 2007-03-27 09:15:44 UTC

Created attachment 43957 [details]
sample file with precomposed and composed equivalent strings

Comment 2 moyogo 2007-03-27 09:16:18 UTC

Created attachment 43958 [details]
sample file with precomposed and composed equivalent strings

Comment 3 thorsten.martens 2007-03-27 09:38:15 UTC

Not a defect but might be a wish for an enhancement.

TM->requirements: please have a look.

Comment 4 allez_les_lions 2010-03-20 14:29:31 UTC

As per Unicode standards (see
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G29705) programs should
normalize Unicode data so pre-composed and de-composed forms are canonically
equivalent. In addition to what moyogo stated about search (and search and
replace) and filenames, a failure to normalize also affects spell-checking
(users can use pre-composed or de-composed forms but the spell-checker may only
be equipped to handle one or the other).

Comment 5 Edwin Sharp 2014-02-24 13:52:59 UTC

attachment has only word ecole twice... ?

Comment 6 allez_les_lions 2014-02-24 16:01:30 UTC

(In reply to Edwin Sharp from comment #5)
> attachment has only word ecole twice... ?

Yes. However, one occurrence uses a precomposed character, the other uses combining diacritics (two Unicode characters that are canonically equivalent to the precomposed character). As it currently stands, a search will only ever find one of the occurrences (either the precomposed or composed version, depending on whether you use a precomposed character in your search string, or compose it with combining diacritics). According to Unicode standards, strings should be normalized so that precomposed and composed strings are canonically equivalent. So, according to Unicode standards, searching using either a precomposed or composed characters should find both occurrences in the sample document. This isn't the case in OpenOffice, though.

Comment 7 Edwin Sharp 2014-02-24 16:51:52 UTC

Thank you allez_les_lions