Issue 71608

Summary:

Bad non-standard hyphenation of diaeresis and Unicode f ligatures

Product:

Writer

Reporter:

nemeth.lacko

Component:

programming

Assignee:

AOO issues mailing list <issues>

Status:

CONFIRMED ---

QA Contact:

Severity:

Trivial

Priority:

CC:

bart.knubben, fonts-bugs, issues, simonbr, thomas.lange

Version:

OOo 2.0.4

Target Milestone:

---

Hardware:

All

OS:

All

Issue Type:

DEFECT

Latest Confirmation in:

---

Developer Difficulty:

---

Attachments:

Description	Flags
Test document	none
screenshot	none
hyphenation pattern for test document (ISO 8859-1, only for Dutch exampel)	none
hyphenation patterns for test data (Unicode, Dutch and Greek)	none
dictionary.lst (link Unicode hyphenation patterns to en_GB (language of the test document)	none
Better ISO-8859-1 hyphenation patterns: extended with patterns for omaatje and cafeetje.	none
screenshot (OOo messagebox with "reeel 1")	none
screenshot (OOo messagebox with "omatje 2")	none

Description nemeth.lacko 2006-11-16 09:09:57 UTC

It seems, non-standard hyphenation (hyphenation with alternative spelling)
support of OOo has an implementation bug: it doesn't break Dutch and Greek words
with diaeresis correctly. Mostly non-standard hyphenation works well (for
example, Dutch omaatje -> oma- tje, cafeetje -> café- tje), but it doesn't with
diaeresis: reëel -> ree- el is bad, we need re- eel. Maybe we need hardwired
language dependent patch... It is not an external hyphenator component problem,
but inner hyphenation support of Writer. Test data attached.

Comment 1 nemeth.lacko 2006-11-16 09:11:29 UTC

Created attachment 40618 [details]
Test document

Comment 2 nemeth.lacko 2006-11-16 09:12:36 UTC

Created attachment 40619 [details]
screenshot

Comment 3 nemeth.lacko 2006-11-16 09:16:08 UTC

Created attachment 40620 [details]
hyphenation pattern for test document (ISO 8859-1, only for Dutch exampel)

Comment 4 nemeth.lacko 2006-11-16 09:17:12 UTC

Created attachment 40621 [details]
hyphenation patterns for test data (Unicode, Dutch and Greek)

Comment 5 nemeth.lacko 2006-11-16 09:18:47 UTC

Created attachment 40622 [details]
dictionary.lst (link Unicode hyphenation patterns to en_GB  (language of the test document)

Comment 6 michael.ruess 2006-11-16 09:23:40 UTC

Reassigned to SBA.

Comment 7 nemeth.lacko 2006-11-16 09:24:57 UTC

Created attachment 40623 [details]
Better ISO-8859-1 hyphenation patterns: extended with patterns for omaatje and cafeetje.

Comment 8 stefan.baltzer 2006-11-21 08:30:02 UTC

SBA-TL: Please proceed.

Comment 9 thomas.lange 2007-02-09 10:52:32 UTC

When looking at it with SRC680 m200 I found the following:
- in SO the hyphenation position for reëel is re=ëel and the hyphenated word
becomes re=eel. As of m202 the hyphenated word is re=ëel.
- OOo the hyphenated word is also re=ëel

The above results were directly obtained from the hyphenator.
(You may use the Basic script below to check)
Thus it is a problem of the specific implementations.
As for SO there can nothing be done but report this to the vendor,
and for OOo someone needs to patch the hyphenation patterns.

Thus I'm reassigning this issue to lingucomponent.

Sub Main

xH = createUnoService("org.openoffice.lingu.LibHnjHyphenator")
'xH = createUnoService("com.sun.star.lingu2.Proximity.Hyphenator")

dim nl_NL as new com.sun.star.lang.Locale
nl_NL.Language = "nl"
nl_NL.Country  = "NL"

xHW = xH.hyphenate( "reëel", nl_NL, 3, DimArray() )
'xHW = xH.hyphenate( "Hundefutter", nl_NL, 3, DimArray() )

msgtxt = " " + xHW.getHyphenatedWord() + " " + xHW.getHyphenPos()
msgbox msgtxt

End Sub

Comment 10 nemeth.lacko 2007-07-06 02:40:16 UTC

TL: Many thanks for the test and the example.

Comment 11 nemeth.lacko 2007-10-03 20:45:25 UTC

Nemeth->TL: I have tried the script with the attached data, and I have got 
"re=eel" (reeel 1) and oma=tje (omatje 2), so it seems for me, it is the bug of
OpenOffice.org's implementation, not the LibHnj non standard hyphenation
extension. Maybe hyphenpos=1 wrongly forbidden by the 2-characters limit.
Please, check my example, not the default Dutch hyphenation pattern. LibHnj
executable works well on my example. Thanks in advance, Laci

Comment 12 nemeth.lacko 2007-10-03 21:31:05 UTC

Created attachment 48684 [details]
screenshot (OOo messagebox with "reeel 1")

Comment 13 nemeth.lacko 2007-10-03 21:32:48 UTC

Created attachment 48685 [details]
screenshot (OOo messagebox with "omatje 2")

Comment 14 thomas.lange 2007-11-23 11:36:23 UTC

Testing with SRC680 m227:
- SO: reëel gets hyphenated in the document as ree-el
  but the hypenator say it should be re-eel
- OOo: reëel gets hyphenated as re-ëel
  and the hyphenator says the same.

I don't know which hyphenator is right or wrong (and if the SO hyphenator result
is wrong it can't be fixed on our side, it needs to be reported to the vendor).
But clearly since the SO hyphenator says re-eel an actual document should behave
similar. Thus we have a problem with the algorithm here.

I don't see any problem with OOo hyphenator unless someone says that the result
from the OOo hyphenator should not be re-ëel because that one is wrong.

Does someone have input on the correct hyphenation of reëel?

For the time being I will keep this issue and with since there seems to be a
problem with the code for evaluating alternative spellings (as already expected).

Comment 15 thomas.lange 2007-11-23 11:40:23 UTC

TL-Nemeth: I missed that the correct hyphenation for reëel was already listed as
being re-eel. Thus the OOo hyphenator or it's dictionary file needs to be fixed.

Since I will use this issue to fix the problem in the code for evaluating
alternative spellings please submit a new one for either or both of the above
changes in OOo.

Comment 16 nemeth.lacko 2007-11-23 13:10:32 UTC

Nemeth->TL: Thanks for your check and comment. This bug report was only a
theoretically problem with attached test data, because nobody worked on Dutch or
Greek non-standard hyphenation patterns a years ago, when I checked my
alternative/non-standard hyphenator patch in OpenOffice.org. But now here is the
result of OpenTaal project, the extended Dutch hyphenation patterns, and
OpenOffice.org (and StarOffice) can't handle half of the Dutch non-standard
hyphenation described by the hyphenation patterns correctly.

I believe, OpenTaal's activity and result (see
http://www.linux.com/feature/116697 for example) and collaboration with OpenTaal
is very important for the future of OpenOffice.org, because we would have
official certificated spell checking and hyphenation in OpenOffice.org at least
for one language. I have modified the language specifics summary according to
your plan. Thanks in advance, Laci

Comment 17 thomas.lange 2007-11-26 13:32:38 UTC

When checking this I found the problem is not the SvxGetAltSpelling function
(which I suspected to be at fault). Instead it is with the actual implementation
that evaluates that result and does the line breaking.

That has two consequences:
a) If that one is to be fixed it needs to be fixed in each application 
   separately. Thus specific issues for Calc and Draw/Impress are required.
b) I was told the area that is effected by the required change is quite 
   tricky and troublesome to change.

Also it looks to me that the actual problem itself is not about the diaeresis at
all. But about the position of the text to be changed:
When comparing it to alternative spelling in the now outdated German pre-reform
spelling the problem is this
- in German Bäc-ker changed to Bäk-ker when getting hyphenated 
- and in Dutch re-ëel should become re-eel 
The difference is that in the German example the char left to the hyphenation
position changes (which is sufficient for German) where in the Dutch example it
is the one to the right.

The code parts that take care of alternative spellings in Writer are rather old
and were probably implemented for German at that time. No one needed text
changes to the right and thus it was never implemented... :-(

Comment 18 thomas.lange 2007-11-26 13:43:10 UTC

If that one gets fixed it should be done future safe.
That is:
- the text change need not be directly next to the hyphen
- it should allow for more than one letter changes to the left
- it should allow for more than one letter changes to the right
- it should allow for all of the above at the same time

Basically speaking it should be able to handle all possible results that the
function SvxGetAltSpelling may return. (And that one is flexible enough to allow
for complete new words...)

Comment 19 thomas.lange 2007-11-26 15:08:18 UTC

Please take over. Thanks!

Comment 20 nemeth.lacko 2007-11-29 11:39:59 UTC

Nemeth->TL: Many thanks for your help. Also I had started to analyze this
problem in the Writer a few years ago, but I had to finish, when I found the
problem out of the linguistic modules. The most important component is the
Writer, so it would be fantastic, if we will have a partial solution for the
text processor. Thanks, Laci

Comment 21 frank.meies 2007-12-04 07:46:16 UTC

Comment 22 nemeth.lacko 2010-03-10 10:41:33 UTC

This is problem for the hyphenation of f ligatures.

eﬃciency -> ef-ﬁciency (Nor a simple ﬁ -> f=i hyphenation doesn't work.)

(By the way, the automatic OpenType solution of ligature handling has also
potential problems: some languages, for example German doesn't use ligatures at
word part boundaries in compound words. Also the HYPHENMIN values depends from
the usage of ligatures. The fi- can be in the end of the lines in Hungarian, but
this hyphenation is deprecated with ligatures.)