Issue 111272

Summary: No hyphenation with Graphite fonts
Product: gsl Reporter: nemeth.lacko
Component: codeAssignee: eric.savary
Status: CLOSED FIXED QA Contact: issues@gsl <issues>
Severity: Trivial    
Priority: P3 CC: devel, hdu, issues, timar74
Version: OOo 3.2   
Target Milestone: OOo 3.3   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 111054, 111277    
Issue Blocks:    
Attachments:
Description Flags
Gentium (not Graphite) and Gentium Basic (Graphite) formatted paragraphs with and without hyphenation
none
New version of the test file with more test cases
none
Screenshot (missing hyphenation in Graphite enabled paragraph)
none
initial patch to use klbLetterBreak as prefered option
none
Bad justification and hyphen position related to the ffi -> f + fi ligature substitutions by Graphite
none
Test document for the hyphenation of words with Graphite character substitutions
none
using klbClipBreak instead
none
Example using 'surfing' which expects hyphenation mid-ligature
none
Example showing Myanmar line breaking, which regresses with current fix.
none
Correct rendering when klbHyphenBreak is used
none
Bad rendering when klbLetterBreak none

Description nemeth.lacko 2010-04-30 09:29:20 UTC
Hyphenated paragraphs formatted with Graphite fonts lost their hyphenation, but
after changing the character style (for example, background color), the
hyphenation back again.
Comment 1 nemeth.lacko 2010-04-30 09:40:37 UTC
I attach a test document with Gentium (not Graphite version) and Graphite Basic
(Graphite version) formatting, and a screenshot, too.
Comment 2 nemeth.lacko 2010-04-30 09:45:58 UTC
Created attachment 69180 [details]
Gentium (not Graphite) and Gentium Basic (Graphite) formatted paragraphs with and without hyphenation
Comment 3 nemeth.lacko 2010-04-30 09:54:21 UTC
Created attachment 69181 [details]
New version of the test file with more test cases
Comment 4 nemeth.lacko 2010-04-30 09:56:51 UTC
Created attachment 69183 [details]
Screenshot (missing hyphenation in Graphite enabled paragraph)
Comment 5 hdu@apache.org 2010-04-30 10:12:27 UTC
Initial debugging shows that Graphite tries to be "too smart" by only providing gr::klbWordBreak aligned 
visual break positions. Writer and EditEngine expect "stupid first fit intra-word" line break positions. There 
was a similar issue 93242 on OSX which was solved by emulating the stupid line break.
@kstribley: using klbHyphenBreak instead of klbWordBreak probably suffices here
Comment 6 hdu@apache.org 2010-04-30 10:22:35 UTC
Correction for a typo above: it was issue 92342
Comment 7 hdu@apache.org 2010-04-30 10:27:31 UTC
Yet another correction: using whole-letter breaking instead of klbWordBreak would match Writer and 
EditEngine's expectations most closely
Comment 8 hdu@apache.org 2010-04-30 11:22:48 UTC
Created attachment 69189 [details]
initial patch to use klbLetterBreak as prefered option
Comment 9 hdu@apache.org 2010-04-30 11:25:35 UTC
Unfortunately even with klbLetterBreak as preferred line breaking option 
gr::RangeSegment::findNextBreakPoint() still returns only word-aligned break positions... a bug in 
Graphite?
Comment 10 nemeth.lacko 2010-04-30 11:27:01 UTC
I have found a problem with the substituted character sequences and hyphenation.
I attach a new test file and a screenshot.
Comment 11 nemeth.lacko 2010-04-30 11:28:27 UTC
Created attachment 69190 [details]
Bad justification and hyphen position related to the ffi -> f + fi ligature substitutions by Graphite
Comment 12 nemeth.lacko 2010-04-30 11:29:35 UTC
Created attachment 69191 [details]
Test document for the hyphenation of words with Graphite character substitutions
Comment 13 nemeth.lacko 2010-04-30 11:31:00 UTC
The last attached document uses Charis SIL font (a Graphite font with automatic
f-ligature substitution).
Comment 14 nemeth.lacko 2010-04-30 11:49:45 UTC
> Unfortunately even with klbLetterBreak as preferred line breaking option 
> gr::RangeSegment::findNextBreakPoint() still returns only word-aligned break
> positions... a bug in Graphite?

hdu: it seems, the Graphite subsystem uses only the hyphenation information of
the Graphite fonts for in-word hyphenation. This information is language
dependent (Asian Graphite fonts) or missing (SIL's latin fonts). From the
Graphite Application Programming Guide:

1.3.9 Graphite can serve not only as a rendering engine, but also as a
line-breaking engine for scripts whose line-breaking behavior can be described
by rules.(Graphite is not adequate to handle scripts that require dictionary
look-up for proper line-breaking.)

We can add basic in-word hyphenation to the language-specifics Graphite fonts,
but in-word hyphenation of Graphite is "not adequate" for most of the European
languages, too.
Comment 15 hdu@apache.org 2010-04-30 12:17:52 UTC
> Graphite subsystem uses only the hyphenation information of the Graphite fonts
> for in-word hyphenation

ah, that would explain the behaviour for ignoring klbHyphenBreak. By why does it also fail for 
klbLetterBreak?

> in-word hyphenation of Graphite is "not adequate" for most of the European languages

Indeed. As I said Writer/EditEngine use their dedicated hyphenator to adjust to the most appropriate 
hyphen position after consulting VCL for providing the first-fit whole-letter line break position. So in 
the current concept the glyph shaping engine has no say in that.

For scripts where OOo's hyphenator does not have support yet the concept of providing hyphen 
suggestions from a smart font table is nice. I am wondering though which scripts would benefit from 
this? Would they use the '-' as hyphen character too? I guess there cannot be many. And for them it is 
probably easier to update OOo's hyphenator for them.
Comment 16 hdu@apache.org 2010-04-30 14:25:00 UTC
Created attachment 69196 [details]
using klbClipBreak instead
Comment 17 hdu@apache.org 2010-04-30 14:30:42 UTC
Using klbClipBreaking instead of hyphen or whole-letter breaking fixes the problem as it does not require 
detailed knowledge of the script inside the generic font. Graphite still returns the full-word line break 
suggestion but using the trick from issue 92342 also solves this. The breakiterator and hyphenator that 
Writer/EditEngine use then do the rest to get proper hyphenation positions.
@nemeth: please check your scenarios using my patch above, issue 111277 is solved by it too
Comment 18 nemeth.lacko 2010-04-30 14:58:14 UTC
> For scripts where OOo's hyphenator does not have support yet the concept of
> providing hyphen suggestions from a smart font table is nice. I am wondering 
> though which scripts would benefit from this? Would they use the '-' as hyphen 
> character too? I guess there cannot be many. And for them it is 
> probably easier to update OOo's hyphenator for them.

Graphite has complete support for arbitrary changes at hyphenation points, end
and beginning of lines, moreover, typesetting a line (for example, Arabic uses
optional elongated character variants at end of the words instead of bigger
spaces for justification). OOo's hyphenation implementation has some problems in
this area, see 

> please check your scenarios using my patch above, issue 111277 is solved by it too

It's fantastic. Many thanks for the quick fix. I will try to check it.
Comment 19 nemeth.lacko 2010-04-30 15:03:36 UTC
... Hyphenation implementation of OOo has some problems in this area, see Issue
71608 (Bad non-standard hyphenation of diaeresis and Unicode f ligatures) ...
Comment 20 devel 2010-05-01 15:21:03 UTC
Thanks very much @nemeth and @hdu for reporting and investigating this and
explaining more about what GetTextBreak() needs to return. I think the long term
solution, will need to take into account the expanded text problem in issue
111054, which will exclude the use of Segment::findNextBreakPoint(...), since
there is no way to tell it in advance the extra char factor. We definitely, want
to avoid recreating a RangeSegment there because of the performance hit, but as
@hdu's TODO suggests that could probably be retrieved from the cache.

As far as the break weights are concerned, the information from the Graphite
GlyphInfo objects is returning a breakweight of +30 for all the letters, which
equates to klbLetterBreak, so in theory that should be sufficient, though
perhaps findNextBreakPoint requires a range like 
findNextBreakPoint(mnMinCharPos, gr::klbLetterBreak, gr::klbClipBreak,
targetWidth, &fBreakWidth ); I think using gr::klbClipBreak will probably cause
problems with SE Asian scripts.

I think the biggest problem may be with the fi and fii ligatures. In the case of
in-suffi-cient it isn't too bad because the hyphenation point isn't
mid-ligature. However, with a word like surfing, it will be more difficult. I've
tweaked my patch for issue 111054 to use gr::klbLetterBreak and the best I get
is surfi-ng.  I'm not sure whether it is possible to get graphite to give a
different break-weight mid-ligature. If the GlyphInfo data is used, then it
certainly won't, since the info per Glyph.

Graphite has the ability to set start and end of line flags when a segment is
created. Currently these are not set, because it can cause trailing space to be
dropped in position calculations, which gives bad results in some of the other
layout methods. In the case of GetTextBreak it might be better to set at least
the end of line flag. The Segment::hasLineBoundaryContext() method could perhaps
be used to optimize this. However, afaik there is no way to know the start of
line context from the current OOo API in VCL. Even if the start and end of line
flags were set, it doesn't change the problem with hyphenation based on a
hyphenation engine external to Graphite. This would probably require an
extension to the Graphite API to allow this information to be passed into the
Graphite Text Source.

As regards scripts without OOo hyphenators, Myanmar script based languages are
one example. In Myanmar spaces are more common at phrase boundaries than word
boundaries. The Padauk graphite font has syllable based boundaries, which is an
improvement on space based though the best approach uses a combination of a
syllable algorithm and a word list. I hope to develop this for Burmese, soon. I
have tested it in the past, but the original word list I was given had copyright
issues, so I need to switch to another one.

My current patch for issue 111054 has a regression for justification, so I won't
post an update until I've diagnosed that.
Comment 21 devel 2010-05-03 07:14:10 UTC
Created attachment 69249 [details]
Example using 'surfing' which expects hyphenation mid-ligature
Comment 22 devel 2010-05-03 07:20:46 UTC
The latest patch attached to issue 111054, attachment 69248 [details], should render the
fiLigHyph document correctly. I got round the potential problem with the
ligature spanning the hyphenation point by disabling extra context being used to
create the segment when SAL_LAYOUT_COMPLEX_DISABLED is set and adjusting the
caching to not return a segment with a cluster spanning the requested mnEndCharPos.
Comment 23 devel 2010-05-03 07:23:03 UTC
Created attachment 69250 [details]
Example showing Myanmar line breaking, which regresses with current fix.
Comment 24 devel 2010-05-03 07:25:03 UTC
Created attachment 69251 [details]
Correct rendering when klbHyphenBreak is used
Comment 25 devel 2010-05-03 07:26:17 UTC
Created attachment 69252 [details]
Bad rendering when klbLetterBreak
Comment 26 devel 2010-05-03 07:40:20 UTC
The Myanmar attachments above show that line breaking occurs mid-syllable when
there are no spaces on a line. The first paragraph has Myanmar punctuation, so
presumably ICU prefers to break after those code points, which results in
reasonable rendering. In practice, Myanmar typists do normally type some spaces,
but it can be problematic in narrow columns.

I can see 2 possible ways to resolve the regression for Burmese with Padauk. 
1) implement a proper dictionary based line breaker for OOo
2) adjust break weight used in GetTextBreak according to whether
SAL_LAYOUT_COMPLEX_DISABLED is set

2) does not necessarily exclude 1), but it is dependent on there being no
complex scripts for which hyphenation is used.
1) is the best long term solution anyway, but would also need to be implemented
for Shan, Mon, Karen and probably several others.

NB: Although the Padauk font uses the klbHyphenBreak, no hyphen should be rendered.
Comment 27 nemeth.lacko 2010-05-03 08:56:36 UTC
> NB: Although the Padauk font uses the klbHyphenBreak, no hyphen should be
rendered.

Hyphenator of OOo supports Unicode, but alternative or missing hyphen is not
yet. With an optional hyphen sign support, OOo's hyphenator would be suitable
for dictionary based line breaking. (Maybe it is better, than ICU's method,
because Liang's hyphenation algorithm based on competing patterns).
Comment 28 hdu@apache.org 2010-05-03 11:55:57 UTC
Using klbCellBreak is indeed overkill because conceptually klbLetterBreak would be sufficient for 
WE/EE's expectations. The problem was that Graphite seemed to ignore klbLetterBreak even when it 
was set as the preferred boundary.

Things such as start/end of line need support from WE/EE. IMHO these engines need to insert special 
codepoints so VCL has a chance to know the intent. Using the line break/ draw requests itself would 
not contain enough info as WE/EE switch "portions" even for simple attribute changes such as the text 
background color. Until Graphite gets these reliable instructions about where Writer wants his line 
starts/ends Graphite must disable these special ligatures.

For the topic of ligature splitting/breaking etc. I suggest that Writer inserts codepoints such as ZW* to 
suggest/allow or disallow ligature splits. This would solve the problem you mentioned and other 
problems in language which like composite words. In these all components of a ligature must be from 
the same morpheme.

> As regards scripts without OOo hyphenators, Myanmar script based languages are one example.
> In Myanmar spaces are more common at phrase boundaries than word boundaries.

This is good example of the where the OOo hyphenator can and should be extended. AFAIK the 
hyphenator in "lingucomponent" just needs its syllable list extended for it.
Comment 29 devel 2010-05-05 08:49:33 UTC
> Until Graphite gets these reliable instructions about where Writer wants his line 
> starts/ends Graphite must disable these special ligatures.

Yes, it is disabled already, see graphite_layout.cxx line 547.
    maLayout.setStartOfLine(false);
    maLayout.setEndOfLine(false);

> > As regards scripts without OOo hyphenators, Myanmar script based languages
are one example.
> > In Myanmar spaces are more common at phrase boundaries than word boundaries.

> This is good example of the where the OOo hyphenator can and should be
extended. AFAIK the 
> hyphenator in "lingucomponent" just needs its syllable list extended for it.

I think a word break iterator should be sufficient to identify words since there
are usually no spaces between them. It is on my to do list! Myanmar Unicode
keyboards are generally not using ZWSP between syllables (unlike Khmer) because
this makes it harder to implement a dictionary based word breaker.
Comment 30 hdu@apache.org 2010-05-06 12:53:01 UTC
The line breaking problem is also fixed by Keith's second patch in issue 111054.
Comment 31 hdu@apache.org 2010-05-12 14:47:46 UTC
@sba: please verify in CWS graphite02
Comment 32 stefan.baltzer 2010-05-31 15:46:44 UTC
SBA->ES: Please take over, thx.
Comment 33 eric.savary 2010-06-07 14:11:46 UTC
Verified in CWS graphite02