Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Description
nemeth.lacko
2010-04-30 09:29:20 UTC
I attach a test document with Gentium (not Graphite version) and Graphite Basic (Graphite version) formatting, and a screenshot, too. Created attachment 69180 [details]
Gentium (not Graphite) and Gentium Basic (Graphite) formatted paragraphs with and without hyphenation
Created attachment 69181 [details]
New version of the test file with more test cases
Created attachment 69183 [details]
Screenshot (missing hyphenation in Graphite enabled paragraph)
Initial debugging shows that Graphite tries to be "too smart" by only providing gr::klbWordBreak aligned visual break positions. Writer and EditEngine expect "stupid first fit intra-word" line break positions. There was a similar issue 93242 on OSX which was solved by emulating the stupid line break. @kstribley: using klbHyphenBreak instead of klbWordBreak probably suffices here Correction for a typo above: it was issue 92342 Yet another correction: using whole-letter breaking instead of klbWordBreak would match Writer and EditEngine's expectations most closely Created attachment 69189 [details]
initial patch to use klbLetterBreak as prefered option
Unfortunately even with klbLetterBreak as preferred line breaking option gr::RangeSegment::findNextBreakPoint() still returns only word-aligned break positions... a bug in Graphite? I have found a problem with the substituted character sequences and hyphenation. I attach a new test file and a screenshot. Created attachment 69190 [details]
Bad justification and hyphen position related to the ffi -> f + fi ligature substitutions by Graphite
Created attachment 69191 [details]
Test document for the hyphenation of words with Graphite character substitutions
The last attached document uses Charis SIL font (a Graphite font with automatic f-ligature substitution). > Unfortunately even with klbLetterBreak as preferred line breaking option
> gr::RangeSegment::findNextBreakPoint() still returns only word-aligned break
> positions... a bug in Graphite?
hdu: it seems, the Graphite subsystem uses only the hyphenation information of
the Graphite fonts for in-word hyphenation. This information is language
dependent (Asian Graphite fonts) or missing (SIL's latin fonts). From the
Graphite Application Programming Guide:
1.3.9 Graphite can serve not only as a rendering engine, but also as a
line-breaking engine for scripts whose line-breaking behavior can be described
by rules.(Graphite is not adequate to handle scripts that require dictionary
look-up for proper line-breaking.)
We can add basic in-word hyphenation to the language-specifics Graphite fonts,
but in-word hyphenation of Graphite is "not adequate" for most of the European
languages, too.
> Graphite subsystem uses only the hyphenation information of the Graphite fonts > for in-word hyphenation ah, that would explain the behaviour for ignoring klbHyphenBreak. By why does it also fail for klbLetterBreak? > in-word hyphenation of Graphite is "not adequate" for most of the European languages Indeed. As I said Writer/EditEngine use their dedicated hyphenator to adjust to the most appropriate hyphen position after consulting VCL for providing the first-fit whole-letter line break position. So in the current concept the glyph shaping engine has no say in that. For scripts where OOo's hyphenator does not have support yet the concept of providing hyphen suggestions from a smart font table is nice. I am wondering though which scripts would benefit from this? Would they use the '-' as hyphen character too? I guess there cannot be many. And for them it is probably easier to update OOo's hyphenator for them. Created attachment 69196 [details]
using klbClipBreak instead
Using klbClipBreaking instead of hyphen or whole-letter breaking fixes the problem as it does not require detailed knowledge of the script inside the generic font. Graphite still returns the full-word line break suggestion but using the trick from issue 92342 also solves this. The breakiterator and hyphenator that Writer/EditEngine use then do the rest to get proper hyphenation positions. @nemeth: please check your scenarios using my patch above, issue 111277 is solved by it too > For scripts where OOo's hyphenator does not have support yet the concept of > providing hyphen suggestions from a smart font table is nice. I am wondering > though which scripts would benefit from this? Would they use the '-' as hyphen > character too? I guess there cannot be many. And for them it is > probably easier to update OOo's hyphenator for them. Graphite has complete support for arbitrary changes at hyphenation points, end and beginning of lines, moreover, typesetting a line (for example, Arabic uses optional elongated character variants at end of the words instead of bigger spaces for justification). OOo's hyphenation implementation has some problems in this area, see > please check your scenarios using my patch above, issue 111277 is solved by it too It's fantastic. Many thanks for the quick fix. I will try to check it. ... Hyphenation implementation of OOo has some problems in this area, see Issue 71608 (Bad non-standard hyphenation of diaeresis and Unicode f ligatures) ... Thanks very much @nemeth and @hdu for reporting and investigating this and explaining more about what GetTextBreak() needs to return. I think the long term solution, will need to take into account the expanded text problem in issue 111054, which will exclude the use of Segment::findNextBreakPoint(...), since there is no way to tell it in advance the extra char factor. We definitely, want to avoid recreating a RangeSegment there because of the performance hit, but as @hdu's TODO suggests that could probably be retrieved from the cache. As far as the break weights are concerned, the information from the Graphite GlyphInfo objects is returning a breakweight of +30 for all the letters, which equates to klbLetterBreak, so in theory that should be sufficient, though perhaps findNextBreakPoint requires a range like findNextBreakPoint(mnMinCharPos, gr::klbLetterBreak, gr::klbClipBreak, targetWidth, &fBreakWidth ); I think using gr::klbClipBreak will probably cause problems with SE Asian scripts. I think the biggest problem may be with the fi and fii ligatures. In the case of in-suffi-cient it isn't too bad because the hyphenation point isn't mid-ligature. However, with a word like surfing, it will be more difficult. I've tweaked my patch for issue 111054 to use gr::klbLetterBreak and the best I get is surfi-ng. I'm not sure whether it is possible to get graphite to give a different break-weight mid-ligature. If the GlyphInfo data is used, then it certainly won't, since the info per Glyph. Graphite has the ability to set start and end of line flags when a segment is created. Currently these are not set, because it can cause trailing space to be dropped in position calculations, which gives bad results in some of the other layout methods. In the case of GetTextBreak it might be better to set at least the end of line flag. The Segment::hasLineBoundaryContext() method could perhaps be used to optimize this. However, afaik there is no way to know the start of line context from the current OOo API in VCL. Even if the start and end of line flags were set, it doesn't change the problem with hyphenation based on a hyphenation engine external to Graphite. This would probably require an extension to the Graphite API to allow this information to be passed into the Graphite Text Source. As regards scripts without OOo hyphenators, Myanmar script based languages are one example. In Myanmar spaces are more common at phrase boundaries than word boundaries. The Padauk graphite font has syllable based boundaries, which is an improvement on space based though the best approach uses a combination of a syllable algorithm and a word list. I hope to develop this for Burmese, soon. I have tested it in the past, but the original word list I was given had copyright issues, so I need to switch to another one. My current patch for issue 111054 has a regression for justification, so I won't post an update until I've diagnosed that. Created attachment 69249 [details]
Example using 'surfing' which expects hyphenation mid-ligature
The latest patch attached to issue 111054, attachment 69248 [details], should render the fiLigHyph document correctly. I got round the potential problem with the ligature spanning the hyphenation point by disabling extra context being used to create the segment when SAL_LAYOUT_COMPLEX_DISABLED is set and adjusting the caching to not return a segment with a cluster spanning the requested mnEndCharPos. Created attachment 69250 [details]
Example showing Myanmar line breaking, which regresses with current fix.
Created attachment 69251 [details]
Correct rendering when klbHyphenBreak is used
Created attachment 69252 [details]
Bad rendering when klbLetterBreak
The Myanmar attachments above show that line breaking occurs mid-syllable when there are no spaces on a line. The first paragraph has Myanmar punctuation, so presumably ICU prefers to break after those code points, which results in reasonable rendering. In practice, Myanmar typists do normally type some spaces, but it can be problematic in narrow columns. I can see 2 possible ways to resolve the regression for Burmese with Padauk. 1) implement a proper dictionary based line breaker for OOo 2) adjust break weight used in GetTextBreak according to whether SAL_LAYOUT_COMPLEX_DISABLED is set 2) does not necessarily exclude 1), but it is dependent on there being no complex scripts for which hyphenation is used. 1) is the best long term solution anyway, but would also need to be implemented for Shan, Mon, Karen and probably several others. NB: Although the Padauk font uses the klbHyphenBreak, no hyphen should be rendered. > NB: Although the Padauk font uses the klbHyphenBreak, no hyphen should be
rendered.
Hyphenator of OOo supports Unicode, but alternative or missing hyphen is not
yet. With an optional hyphen sign support, OOo's hyphenator would be suitable
for dictionary based line breaking. (Maybe it is better, than ICU's method,
because Liang's hyphenation algorithm based on competing patterns).
Using klbCellBreak is indeed overkill because conceptually klbLetterBreak would be sufficient for
WE/EE's expectations. The problem was that Graphite seemed to ignore klbLetterBreak even when it
was set as the preferred boundary.
Things such as start/end of line need support from WE/EE. IMHO these engines need to insert special
codepoints so VCL has a chance to know the intent. Using the line break/ draw requests itself would
not contain enough info as WE/EE switch "portions" even for simple attribute changes such as the text
background color. Until Graphite gets these reliable instructions about where Writer wants his line
starts/ends Graphite must disable these special ligatures.
For the topic of ligature splitting/breaking etc. I suggest that Writer inserts codepoints such as ZW* to
suggest/allow or disallow ligature splits. This would solve the problem you mentioned and other
problems in language which like composite words. In these all components of a ligature must be from
the same morpheme.
> As regards scripts without OOo hyphenators, Myanmar script based languages are one example.
> In Myanmar spaces are more common at phrase boundaries than word boundaries.
This is good example of the where the OOo hyphenator can and should be extended. AFAIK the
hyphenator in "lingucomponent" just needs its syllable list extended for it.
> Until Graphite gets these reliable instructions about where Writer wants his line > starts/ends Graphite must disable these special ligatures. Yes, it is disabled already, see graphite_layout.cxx line 547. maLayout.setStartOfLine(false); maLayout.setEndOfLine(false); > > As regards scripts without OOo hyphenators, Myanmar script based languages are one example. > > In Myanmar spaces are more common at phrase boundaries than word boundaries. > This is good example of the where the OOo hyphenator can and should be extended. AFAIK the > hyphenator in "lingucomponent" just needs its syllable list extended for it. I think a word break iterator should be sufficient to identify words since there are usually no spaces between them. It is on my to do list! Myanmar Unicode keyboards are generally not using ZWSP between syllables (unlike Khmer) because this makes it harder to implement a dictionary based word breaker. The line breaking problem is also fixed by Keith's second patch in issue 111054. @sba: please verify in CWS graphite02 SBA->ES: Please take over, thx. Verified in CWS graphite02 |