Issue 58558 - Discretionary hyphenation patch with Unicode support
Summary: Discretionary hyphenation patch with Unicode support
Status: CLOSED FIXED
Alias: None
Product: lingucomponent
Classification: Code
Component: other (show other issues)
Version: OOo 2.0.1
Hardware: All All
: P3 Trivial (vote)
Target Milestone: OOo 2.0.2
Assignee: stefan.baltzer
QA Contact: issues@lingucomponent
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-28 13:11 UTC by nemeth.lacko
Modified: 2006-06-12 15:24 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Standalone version of the improved Altlinux Libhnj hyphenator (131.46 KB, application/octet-stream)
2006-01-27 15:03 UTC, nemeth.lacko
no flags Details
Test data (with libhyphen for Linux) (43.89 KB, application/octet-stream)
2006-01-27 17:45 UTC, nemeth.lacko
no flags Details
screenshot of discretionary hyphenation test example in OOo Writer (89.18 KB, image/png)
2006-01-27 17:47 UTC, nemeth.lacko
no flags Details
discretionary hyphenation README (3.52 KB, text/plain)
2006-04-12 12:31 UTC, nemeth.lacko
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description nemeth.lacko 2005-11-28 13:11:05 UTC
OpenOffice.org supports the alternations at hyphenation points:
http://api.openoffice.org/docs/common/ref/com/sun/star/linguistic2/XPossibleHyphens.html
But OOo's hyphenator (AltLinux libhjn) doesn't it.

It is crucial for the Hungarian hyphenations:

In Hungarian there are 8 2-character letters (cs, dz, gy, ly, ny, sz, ty, zs)
plus a 3-character one (dzs). Long (double) forms of these letters are written
with simplification (ccs, ddz, ddzs, ggy, lly, nny, ssz, tty, zzs), but
hyphenated with full form (cs-cs, dz-dz, dzs-dzs, gy-gy, ly-ly, ny-ny, sz-sz,
ty-ty, zs-zs). For example: hyphenation of "asszonnyal" (with woman) ->
asz-szony-nyal. Long double forms are very frequent. 

It is also useful for the old (but not outdated) German hyphenation.
For example: Schiffahrt -> Schiff-fahrt.
Comment 1 nemeth.lacko 2005-11-28 13:15:18 UTC
Daniel, Björn: Please, help to extend the German TeX hyphenation patterns with
alternations according to the old German orthography. I will publish an extended
AltLinux libhnj library soon. With the modified library, you can define similar
hyphenation rules for old German:
schiffahrt/schiff=fahrt
or with index, and deletion lenght:
schiffahrt/ff=f,5,2

Regards,

Laci

Comment 2 ooolist2007 2005-11-28 18:38:12 UTC
Laci, are you aware of the discussions in the lingucomponent list that show  
problems with German hyphenation (subject "state of hyphenation code"): namely  
that TeX hyphenation differs from OOo hyphenation although it shouldn't? Will  
you address that issue, too? The problem I guess is that I never found out  
where *exactly* the bug is.  
  
I won't be able to spend any time working on the old German orthography, but  
I'd be happy to solve the problem mentioned above (which affects new spelling  
but probably also old spelling).  
 
Comment 3 nemeth.lacko 2005-11-29 01:19:24 UTC
Daniel, I think, you have only a TeX configuration problem (see my respond on
the list). Please, check it.

Thanks,

Laci
Comment 4 ooolist2007 2006-01-06 19:41:53 UTC
Laci, see me reply on the mailing list. Unfortunately it's not that simple :-( 
Comment 5 ooolist2007 2006-01-14 15:18:44 UTC
The problem about German hyphenation I mentioned in a comment has now been 
solved (see lingucomponent mailing list), i.e. this bug report can now be used 
for its original issue again :-) 
 
Comment 6 nemeth.lacko 2006-01-27 15:03:36 UTC
Created attachment 33618 [details]
Standalone version of the improved Altlinux Libhnj hyphenator
Comment 7 nemeth.lacko 2006-01-27 15:16:20 UTC
CWS "hyphenator2" contains the improved hyphenator for discretionary hyphenation
and Unicode. I'am attaching the standalone version of the extended AltLinux
Libhnj hyphenator, the test data (see README in the compressed directory) and a
screenshot.

About testing: source now has a lot of test data, for example, tests/base
contains and checks the 10% of the /usr/share/dict/words (make check). "make
valgrind" performs also memory debugging on these tests. Previous version of
this discretionary hyphenation patch was published with the official Hungarian
OpenOffice.org 2.0.1. The attached test contains Unicode and ISO8859-1 tests.

Discretionary hyphenation and Unicode support are competitive features (MS
Office does discretionary hyphenation, I think, also Unicode).
Comment 8 nemeth.lacko 2006-01-27 17:45:04 UTC
Created attachment 33621 [details]
Test data (with libhyphen for Linux)
Comment 9 nemeth.lacko 2006-01-27 17:47:15 UTC
Created attachment 33622 [details]
screenshot of discretionary hyphenation test example in OOo Writer
Comment 10 nemeth.lacko 2006-01-27 18:06:48 UTC
.
Comment 11 nemeth.lacko 2006-01-27 18:08:37 UTC
.
Comment 12 stefan.baltzer 2006-02-03 15:27:27 UTC
SBA: Verified in CWS hyphenator2.
Comment 13 simonbr 2006-02-07 19:33:55 UTC
Hi Laci, 

I'm trying to understand how this works; is the following correct?

In "a1atje./a=t,1,3", the part "a1atje." specifies the pattern on which the rule
applies (i.e. "aatje" at the end of a word), and the part "a=t,1,3" means that
the 3 letters in this pattern starting at position 1 ("aat") are replaced by the
hyphenated sequence "a=t" i.e. aatje -> a=tje

What does the '1' in the pattern mean?

Can I simply add rules like this at the end of hyph_nl_NL.dic?
Comment 14 nemeth.lacko 2006-02-07 20:56:56 UTC
Hi Simon,

> In "a1atje./a=t,1,3", the part "a1atje." specifies the pattern on which the rule
> applies (i.e. "aatje" at the end of a word), and the part "a=t,1,3" means that
> the 3 letters in this pattern starting at position 1 ("aat") are replaced by the
> hyphenated sequence "a=t" i.e. aatje -> a=tje

Yes, it does.

> What does the '1' in the pattern mean?

Odd numbers sign break points in Liang's algorithm. For a hyph_nl_NL patch, you
will need big odd numbers to set discretionary patterns on.
See README.hyphen in the attached library.

> Can I simply add rules like this at the end of hyph_nl_NL.dic?

Yes, you can, but the result must be processed with the substrings.pl program.

First, you need some hyphenation pattern analysis:

]$ grep e[0-9]*e[0-9]*t[0-9]*j[0-9] hyph_nl_NL.dic
.pee5tj2
ee3tj2
hee3tj2
plee5tj2
slee5tj2

$ grep e[0-9]e[0-9] hyph_nl_NL.dic
.al3e4e4
.met5e4e2
.ne4e4
.on4te4e2
.op5e4e4
.p4e4e4
.s4e4e3
afe4e4
a3ge4e4
ave4e4
...

Even numbers sign "no break" points. We need greater odd number, than 4
to permit discretionary hyphenation:

e5etje/é=t,1,3
a5atje/a=t,1,3

(there are also an other discretionary hyphenation group in Dutch:
``Dutch has two unusual hyphenations: The diminutive tje (for example) strootje
is hyphenated stro-tje the final o of the first syllable being deleted in the
hyphenated word. Where the gei(diaeresis), transliterated (geI), geu(diaereses),
transliterated (geU), gee(diaeresis), transliterated (geE), koo(diaeresis),
transliterated (koO), hyphenate ge-i, ge-e, ge-u and ko-o the second vowel
having lost the diaeresis.'': http://www.hyphenologist.co.uk/man-4-415.htm)

If you need some exception, you can add patterns with a greater even number
signed hyphenation point:

example6etje (not break point)

or

examle7etje (not discretionary hyphenation)

See also README.discretionary in the altlinuxHyph2.tar.gz.

I also suggest making a Dutch test file.

Regards,
Laci

Comment 15 nemeth.lacko 2006-04-12 12:31:16 UTC
Created attachment 35655 [details]
discretionary hyphenation README
Comment 16 stefan.baltzer 2006-06-12 15:24:43 UTC
SBA: Closed.