Issue 71608

Summary: Bad non-standard hyphenation of diaeresis and Unicode f ligatures
Product: Writer Reporter: nemeth.lacko
Component: programmingAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: bart.knubben, fonts-bugs, issues, simonbr, thomas.lange
Version: OOo 2.0.4   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Test document
none
screenshot
none
hyphenation pattern for test document (ISO 8859-1, only for Dutch exampel)
none
hyphenation patterns for test data (Unicode, Dutch and Greek)
none
dictionary.lst (link Unicode hyphenation patterns to en_GB (language of the test document)
none
Better ISO-8859-1 hyphenation patterns: extended with patterns for omaatje and cafeetje.
none
screenshot (OOo messagebox with "reeel 1")
none
screenshot (OOo messagebox with "omatje 2") none

Description nemeth.lacko 2006-11-16 09:09:57 UTC
It seems, non-standard hyphenation (hyphenation with alternative spelling)
support of OOo has an implementation bug: it doesn't break Dutch and Greek words
with diaeresis correctly. Mostly non-standard hyphenation works well (for
example, Dutch omaatje -> oma- tje, cafeetje -> café- tje), but it doesn't with
diaeresis: reëel -> ree- el is bad, we need re- eel. Maybe we need hardwired
language dependent patch... It is not an external hyphenator component problem,
but inner hyphenation support of Writer. Test data attached.
Comment 1 nemeth.lacko 2006-11-16 09:11:29 UTC
Created attachment 40618 [details]
Test document
Comment 2 nemeth.lacko 2006-11-16 09:12:36 UTC
Created attachment 40619 [details]
screenshot
Comment 3 nemeth.lacko 2006-11-16 09:16:08 UTC
Created attachment 40620 [details]
hyphenation pattern for test document (ISO 8859-1, only for Dutch exampel)
Comment 4 nemeth.lacko 2006-11-16 09:17:12 UTC
Created attachment 40621 [details]
hyphenation patterns for test data (Unicode, Dutch and Greek)
Comment 5 nemeth.lacko 2006-11-16 09:18:47 UTC
Created attachment 40622 [details]
dictionary.lst (link Unicode hyphenation patterns to en_GB  (language of the test document)
Comment 6 michael.ruess 2006-11-16 09:23:40 UTC
Reassigned to SBA.
Comment 7 nemeth.lacko 2006-11-16 09:24:57 UTC
Created attachment 40623 [details]
Better ISO-8859-1 hyphenation patterns: extended with patterns for omaatje and cafeetje.
Comment 8 stefan.baltzer 2006-11-21 08:30:02 UTC
SBA-TL: Please proceed.
Comment 9 thomas.lange 2007-02-09 10:52:32 UTC
When looking at it with SRC680 m200 I found the following:
- in SO the hyphenation position for reëel is re=ëel and the hyphenated word
becomes re=eel. As of m202 the hyphenated word is re=ëel.
- OOo the hyphenated word is also re=ëel

The above results were directly obtained from the hyphenator.
(You may use the Basic script below to check)
Thus it is a problem of the specific implementations.
As for SO there can nothing be done but report this to the vendor,
and for OOo someone needs to patch the hyphenation patterns.

Thus I'm reassigning this issue to lingucomponent.

Sub Main

xH = createUnoService("org.openoffice.lingu.LibHnjHyphenator")
'xH = createUnoService("com.sun.star.lingu2.Proximity.Hyphenator")

dim nl_NL as new com.sun.star.lang.Locale
nl_NL.Language = "nl"
nl_NL.Country  = "NL"

xHW = xH.hyphenate( "reëel", nl_NL, 3, DimArray() )
'xHW = xH.hyphenate( "Hundefutter", nl_NL, 3, DimArray() )

msgtxt = " " + xHW.getHyphenatedWord() + " " + xHW.getHyphenPos()
msgbox msgtxt

End Sub
Comment 10 nemeth.lacko 2007-07-06 02:40:16 UTC
TL: Many thanks for the test and the example.
Comment 11 nemeth.lacko 2007-10-03 20:45:25 UTC
Nemeth->TL: I have tried the script with the attached data, and I have got 
"re=eel" (reeel 1) and oma=tje (omatje 2), so it seems for me, it is the bug of
OpenOffice.org's implementation, not the LibHnj non standard hyphenation
extension. Maybe hyphenpos=1 wrongly forbidden by the 2-characters limit.
Please, check my example, not the default Dutch hyphenation pattern. LibHnj
executable works well on my example. Thanks in advance, Laci
Comment 12 nemeth.lacko 2007-10-03 21:31:05 UTC
Created attachment 48684 [details]
screenshot (OOo messagebox with "reeel 1")
Comment 13 nemeth.lacko 2007-10-03 21:32:48 UTC
Created attachment 48685 [details]
screenshot (OOo messagebox with "omatje 2")
Comment 14 thomas.lange 2007-11-23 11:36:23 UTC
Testing with SRC680 m227:
- SO: reëel gets hyphenated in the document as ree-el
  but the hypenator say it should be re-eel
- OOo: reëel gets hyphenated as re-ëel
  and the hyphenator says the same.

I don't know which hyphenator is right or wrong (and if the SO hyphenator result
is wrong it can't be fixed on our side, it needs to be reported to the vendor).
But clearly since the SO hyphenator says re-eel an actual document should behave
similar. Thus we have a problem with the algorithm here.

I don't see any problem with OOo hyphenator unless someone says that the result
from the OOo hyphenator should not be re-ëel because that one is wrong.

Does someone have input on the correct hyphenation of reëel?

For the time being I will keep this issue and with since there seems to be a
problem with the code for evaluating alternative spellings (as already expected).
Comment 15 thomas.lange 2007-11-23 11:40:23 UTC
TL-Nemeth: I missed that the correct hyphenation for reëel was already listed as
being re-eel. Thus the OOo hyphenator or it's dictionary file needs to be fixed.

Since I will use this issue to fix the problem in the code for evaluating
alternative spellings please submit a new one for either or both of the above
changes in OOo.
Comment 16 nemeth.lacko 2007-11-23 13:10:32 UTC
Nemeth->TL: Thanks for your check and comment. This bug report was only a
theoretically problem with attached test data, because nobody worked on Dutch or
Greek non-standard hyphenation patterns a years ago, when I checked my
alternative/non-standard hyphenator patch in OpenOffice.org. But now here is the
result of OpenTaal project, the extended Dutch hyphenation patterns, and
OpenOffice.org (and StarOffice) can't handle half of the Dutch non-standard
hyphenation described by the hyphenation patterns correctly.

I believe, OpenTaal's activity and result (see
http://www.linux.com/feature/116697 for example) and collaboration with OpenTaal
is very important for the future of OpenOffice.org, because we would have
official certificated spell checking and hyphenation in OpenOffice.org at least
for one language. I have modified the language specifics summary according to
your plan. Thanks in advance, Laci
Comment 17 thomas.lange 2007-11-26 13:32:38 UTC
When checking this I found the problem is not the SvxGetAltSpelling function
(which I suspected to be at fault). Instead it is with the actual implementation
that evaluates that result and does the line breaking.

That has two consequences:
a) If that one is to be fixed it needs to be fixed in each application 
   separately. Thus specific issues for Calc and Draw/Impress are required.
b) I was told the area that is effected by the required change is quite 
   tricky and troublesome to change.

Also it looks to me that the actual problem itself is not about the diaeresis at
all. But about the position of the text to be changed:
When comparing it to alternative spelling in the now outdated German pre-reform
spelling the problem is this
- in German Bäc-ker changed to Bäk-ker when getting hyphenated 
- and in Dutch re-ëel should become re-eel 
The difference is that in the German example the char left to the hyphenation
position changes (which is sufficient for German) where in the Dutch example it
is the one to the right.

The code parts that take care of alternative spellings in Writer are rather old
and were probably implemented for German at that time. No one needed text
changes to the right and thus it was never implemented... :-(
Comment 18 thomas.lange 2007-11-26 13:43:10 UTC
If that one gets fixed it should be done future safe.
That is:
- the text change need not be directly next to the hyphen
- it should allow for more than one letter changes to the left
- it should allow for more than one letter changes to the right
- it should allow for all of the above at the same time

Basically speaking it should be able to handle all possible results that the
function SvxGetAltSpelling may return. (And that one is flexible enough to allow
for complete new words...)
Comment 19 thomas.lange 2007-11-26 15:08:18 UTC
Please take over. Thanks!
Comment 20 nemeth.lacko 2007-11-29 11:39:59 UTC
Nemeth->TL: Many thanks for your help. Also I had started to analyze this
problem in the Writer a few years ago, but I had to finish, when I found the
problem out of the linguistic modules. The most important component is the
Writer, so it would be fantastic, if we will have a partial solution for the
text processor. Thanks, Laci
Comment 21 frank.meies 2007-12-04 07:46:16 UTC
.
Comment 22 nemeth.lacko 2010-03-10 10:41:33 UTC
This is problem for the hyphenation of f ligatures.

efficiency -> ef-ficiency (Nor a simple fi -> f=i hyphenation doesn't work.)

(By the way, the automatic OpenType solution of ligature handling has also
potential problems: some languages, for example German doesn't use ligatures at
word part boundaries in compound words. Also the HYPHENMIN values depends from
the usage of ligatures. The fi- can be in the end of the lines in Hungarian, but
this hyphenation is deprecated with ligatures.)