Issue 42171

Summary: Display of invalid Thai combining character sequences broken on Windows
Product: gsl Reporter: samphan
Component: codeAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: arthit, hin.stone, issues, jjc, khirano, markpeak, nusorn
Version: 680m74   
Target Milestone: AOO PleaseHelp   
Hardware: PC   
OS: Windows XP   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on:    
Issue Blocks: 41707    
Attachments:
Description Flags
Text document with invalid Thai combining character sequences
none
Screenshot of the document displayed correctly on Linux
none
Screenshot of the document displayed on Windows
none
Screenshot of the document displayed on Windows, reformat to use Angsana none

Description samphan 2005-02-07 07:26:30 UTC
A combining character sequence such as gor gai+mai ek+sara ii (0e01+0e48
+0e35) is not displayed properly on Windows.  It should be displayed as
gor gai with the mai ek and then dotted circle with sara ii.  It *is*
displayed in this way in Linux. On Windows, with the old Windows Thai
fonts, such as Angsana and Browalia, an ugly black box is show, and it
is not clear that there is a sara ii there. Much more seriously, with
more recent fonts such as Tahoma, the sara ii does not show up at all.

The combining character sequences that are not displayed properly are
sequences that Windows cannot display in a single cell.  Such sequences
never occur in correct Thai.  Conventionally, most applications on
Windows prevent the input of such invalid sequences.  However, OOo does
not always do this and it is anyway possible for such sequences to occur
in imported data.  It is important that such sequences be highly visible
to the user so that the user can correct them.

Test case:
1) Load the attached document (with invalid combining character sequences) on
Linux. The display use dotted circles to ensure that all combining characters in
invalid combining character sequences are clearly displayed. See the first
screenshot attached.

2) Load the same document on Windows. You'll not see any dotted-circle. See the
second screenshot. So you'll not know that this document has errors in it. 

3) Reformat the document to use the font Angsana (or Browallia or other Windows
Thai fonts). You'll see black boxes where there are invalid combining character
sequences. See the third screenshot. This let you know that there're errors but
you can't tell what the error is. Using Tahoma or Microsoft Sans Serif or Lucida
Sans Unicode (which have the glyph for dotted circle) instead, and there are no
black boxes but there are no dotted circle either.
Comment 1 samphan 2005-02-07 07:27:57 UTC
Created attachment 22278 [details]
Text document with invalid Thai combining character sequences
Comment 2 samphan 2005-02-07 07:30:40 UTC
Created attachment 22279 [details]
Screenshot of the document displayed correctly on Linux
Comment 3 samphan 2005-02-07 07:31:57 UTC
Created attachment 22280 [details]
Screenshot of the document displayed on Windows
Comment 4 samphan 2005-02-07 07:33:13 UTC
Created attachment 22281 [details]
Screenshot of the document displayed on Windows, reformat to use Angsana
Comment 5 falko.tesch 2005-02-09 16:36:39 UTC
Hi Karl, seems for some reason that the iterator is broken (only under Windows?).
Can you please check if this can bwe fixed or if this is a font specific matter
(just a wild guess, though)?. Thx in advance.
Comment 6 karl.hong 2005-02-10 22:57:48 UTC
Karl: This is not a breakiterator issue, but layout engine issue. Linux and
Window use different engines, Window uses native Uniscribe while Linux use ICU
layout engine. 

For preventing entering invalid sequence, we do have input sequence checking,
but it was broken.  

I will create a new issue to fix broken input sequence checking and transfer
this one to Herbert for fixing layout engine.
Comment 7 hdu@apache.org 2005-02-15 15:56:59 UTC
Can reproduce.
Comment 8 hdu@apache.org 2005-02-15 16:49:30 UTC
Unfortunately we are 100% compatible here with an important legacy application
from a major competitor, because we use the same layout engine... so the problem
is in the Uniscribe library which is outside OOo's scope.

Thanks for the great bugdocs and the excellent bug report which made reproducing
the problem easy.
Comment 9 jjc 2005-02-15 17:14:24 UTC
Thanks for looking into this issue.  So if I understand correctly, the situation
is that:

a) Uniscribe has a bug/limitation that it displays invalid combinining character
sequences poorly

b) OOo sometimes gives Uniscribe invalid combining character sequences to display

I don't think it follows from this that nothing needs changing in OOo.

For example, if the document contains 0e01+0e48+0e35, which Uniscribe cannot
display properly, the OOo display engine might transform that to
0e01+0e48+25cc+0e35 before giving it to Uniscribe to display.

Alternatively the Sequence Input Checking could be made more vigorous on Windows
so that it is impossible for the user to enter such invalid sequences (which I
believe is the case with some competitor products).

The current situation may well be Uniscribe's fault, but it is not an acceptable
situation for OOo Thai users on Windows, and I find it hard to believe that
there is nothing OOo can do to improve the situation.
Comment 10 hdu@apache.org 2005-02-21 18:21:32 UTC
Ok, it is possible to workaround the issue by changing invalid sequences to
valid ones.
Comment 11 hdu@apache.org 2005-02-21 18:22:37 UTC
HDU->FME: please work with Karl to convert invalid character sequences into
valid ones...
Comment 12 frank.meies 2005-02-22 07:49:24 UTC
FME->FT: And finally back to you. I think this means we should implement a "type
and replace" feature for sequence input checking, as know from a competitor. In
this case we need a more detailed desciption of the functionality of this feature.
Comment 13 frank.meies 2005-02-22 07:50:10 UTC
.
Comment 14 jjc 2005-02-22 08:17:27 UTC
"Type and replace" is issue 42661.  That's is a separate (although related)
issue.  "Type and replace" is about how to prevent invalid combining character
sequences getting into your document.  The issue here is what happens if your
document contains an invalid combining character sequence; that can happen when
you load a document or when you turn off sequence input checking and "type and
replace".  In order to display invalid combining character sequences with
Uniscribe, it is necessary to transform invalid combining character sequences to
sequences that can be displayed by Uniscribe (e.g. by inserting dotted circle
glyphs) as part of the display process; this wouldn't change the logical content
of the document which would still contain invalid combining character sequences.
Comment 15 samphan 2005-02-22 08:36:05 UTC
I'm wondering why Uniscribe doesn't support displaying invalid combining
character sequence. It is said here
http://www.microsoft.com/typography/otfntdev/thaiot/shaping.aspx#comb
and http://www.microsoft.com/typography/OpenType%20Dev/arabic/shaping.mspx#invalid
and
http://www.microsoft.com/typography/OpenType%20Dev/lao/shaping.mspx#invalid
Maybe it is implemented in every CTL languages mentioned here
http://www.microsoft.com/typography/SpecificationsOverview.mspx
Comment 16 falko.tesch 2005-10-20 20:34:56 UTC
FT: Back to you Samphan. For the moment I do not see that we can do such thing
without the help from the outside. Please provide spec and patch/code first.
please do ont assign this issue to me again since I'm leaving this position. thx
Comment 17 arthit 2008-04-23 08:01:14 UTC
any Windows user can confirmed if this still occurs in the latest OOo ?
Comment 18 Marcus 2017-05-20 11:29:19 UTC
Reset assigne to the default "issues@openoffice.apache.org".