Issue 41246 - Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly
Summary: Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly
Alias: None
Product: Internationalization
Classification: Code
Component: code (show other issues)
Version: 680m71
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
Keywords: oooqa
Depends on:
Reported: 2005-01-25 03:21 UTC by david
Modified: 2013-08-07 15:00 UTC (History)
4 users (show)

See Also:
Latest Confirmation in: ---
Developer Difficulty: ---

plain text file containing CGJ U+034F (in UTF-8) (25 bytes, text/plain)
2005-05-31 22:28 UTC, david
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description david 2005-01-25 03:21:08 UTC
U+034F is a control character which should be invisible, and by default should
be ignored when searching and sorting.  For example, the text "C\u034Fhester"
should render identically to "Chester", and a search for "Chest" should find it.

But at the moment, the presence of U+034F corrupts the rendering of a line of
text, and the search described above fails.  In certain languages, this prevents
the searching and algorithmic sorting of texts.

[*] From Section 15.2 of the Unicode Standard 4.0: "In language-sensitive
collation and searching, the combining grapheme joiner should be ignored [...]
in the default collation.  [...]  For rendering, the combining grapheme joiner
is invisible."
Comment 1 jack.warchold 2005-02-16 14:03:35 UTC
reassigned to us

can you please take a look on this?
Comment 2 flibby05 2005-03-12 20:07:46 UTC
set to NEW
Comment 3 ulf.stroehler 2005-05-30 13:14:39 UTC
@submitter/maxweber: could you pls. evaluate whether the problem still persits
in a current milestone. Thx.
Comment 4 flibby05 2005-05-30 13:33:58 UTC
us -> max
Comment 5 ulf.stroehler 2005-05-30 14:48:03 UTC
Comment 6 flibby05 2005-05-30 18:19:31 UTC
>>@submitter/maxweber: could you pls. evaluate whether the problem still persits
>>in a current milestone. Thx.

i reassigned the issue to me to express that i take it as my task to reproduce it.
Comment 7 flibby05 2005-05-30 18:22:49 UTC
reassign max -> us
Comment 8 ulf.stroehler 2005-05-31 18:19:25 UTC
us@maxweber: to avoid further confusion; could you pls. let me know what your
findings are. Thanks.
Comment 9 flibby05 2005-05-31 19:16:42 UTC
us, divec:
i cannot find CGJ via insert -> special character on my SuSE 9.3 with ttf 'symbol'.
any other ttf which would be helpful for reproducing this issue?
Comment 10 david 2005-05-31 22:28:59 UTC
Created attachment 26776 [details]
plain text file containing CGJ U+034F (in UTF-8)
Comment 11 david 2005-05-31 22:55:49 UTC
Thanks for looking at this!  I've attached a text file containing the CGJ.  You
have to load it in OOo with filetype "encoded text" and then choose the "UTF-8"
character set.  If you see a capital I with an acute accent, it has loaded with
the wrong character set.

The file contains the text "Ban\u034Fgor -> Ban<CGJ>gor" (i.e. the first
backslash is a real backslash, and the actual CGJ after "->" - sorry, I
should've made the example simpler).

When loaded, "Ban<CGJ>gor" *should* render as "Bangor" (i.e. the CGJ should be
invisible).  Searching for "Bangor" should succeed.

I've just tried it with m103 on Linux, it renders as "Ban[]gor" (i.e. with a
square in place of the CGJ), and searching for "Bangor" fails.  In other words,
the CGJ is being treated like a "normal" printable character which is not in the
font, instead of as a control character.

Actually, I've just noticed that's not quite true, because if you move the
cursor over "Ban<CGJ>gor", will not fall between the "n" and the CGJ.  So the
CGJ is presumably being recognised as modifying the "n", but without the correct
behaviour and semantics.

By the way, when I tried this originally, using m65 on Windows, the CGJ did not
display but the text of the whole line became corrupted.  Should this be tried
on Windows with a more recent milestone?
Comment 12 ulf.stroehler 2005-06-01 10:11:26 UTC
@divec: thanks for the explanation and example document.
You evaluation still holds true for e.g. a m106 what makes me think that we
simply not support this control character (at least not in Writer). Different
control chars as e.g. BOM (Byte Order Marks) work though.
Additionally could you provide a typical use scenario for this control char e.g.
in a wordprocessor app or is it just to be compliant with the Unicode spec. Thx.

US->HDU/SSA: something we want to support in vcl or do we need a decision from
UserEx group first?
Comment 13 2005-06-02 08:25:18 UTC
Since we try to support the latest unicode standard supporting the U+034F (new
since Unicode4?) doesn't need special approval by UX, but they need to work on
it. This issue should get split up into three sub-issues:
- displaying the U+034F with "show non-printable characters enabled" needs to be
defined => UX
- sorting/searching of text containing U+034F => ER
- not showing U+034F as a "NotDef" box => HDU
Comment 14 ulf.stroehler 2006-04-04 17:19:00 UTC
have to reassign issue.
Comment 15 eric.savary 2006-08-29 14:54:14 UTC
Feature design overrides other issues.

ES->Requirements: Please consider splitting this enhancement in 3 parts as HDU
stated in its comment from Thu Jun 2 00:25:18 -0700 2005.