Issue 109543 - Update Hyphen hyphenation library (improved hyphenation) and English hyphenation patterns
Summary: Update Hyphen hyphenation library (improved hyphenation) and English hyphenat...
Status: CLOSED FIXED
Alias: None
Product: lingucomponent
Classification: Code
Component: other (show other issues)
Version: OOo 3.2
Hardware: Unknown All
: P3 Trivial (vote)
Target Milestone: 3.4.1
Assignee: stefan.baltzer
QA Contact: issues@lingucomponent
URL: https://sourceforge.net/projects/huns...
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-23 16:19 UTC by nemeth.lacko
Modified: 2017-05-20 09:01 UTC (History)
5 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Improved English hyphenation dictionaries (92.53 KB, application/x-compressed)
2010-11-27 03:09 UTC, nemeth.lacko
no flags Details
A list with ~450 words with bad (<) and fixed (>) hyphenation. (17.70 KB, text/plain)
2010-11-29 22:12 UTC, nemeth.lacko
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description nemeth.lacko 2010-02-23 16:19:06 UTC
Please update Hyphen library to fix hyphenation and improve hyphenation of words
with Unicode ligatures.

Lefthyphen calculation of UTF-8 encoded hyphenation dictionaries had an error,
resulted missing hyphenation points in some words with diacritics. See
tests/lhmin.* in Hyphen-2.5 source distribution
(https://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download).

Improved en_US hyphenation dictionary
(https://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyph_en_US.zip/download)
of Hyphen 2.5 contains hyphenation pattens for words with Unicode f ligatures,
too. The new UTF-8 encoded dictionary was checked on Linux words. The apostrophe
patch of en_US hyphenation patterns of OpenOffice.org converted to recognize
typographic apostrophes, too.
Comment 1 rene 2010-02-23 16:45:54 UTC
nemethl: sorry, but it doesn't build; missed a file in the tarball?

[...]
make[1]: Entering directory `/tmp/hyphen-2.5'
perl ./substrings.pl hyphen.us3 hyphen.us4 UTF-8 2 3 >/dev/null
cat hyphen.us4 | /bin/sed -f ./ooopatch.sed >hyph_en_US.dic
/bin/sed: couldn't open file ./ooopatch.sed: No such file or directory
make[1]: *** [hyph_en_US.dic] Error 4
make[1]: Leaving directory `/tmp/hyphen-2.5'
make: *** [all-recursive] Error 1
Comment 2 nemeth.lacko 2010-02-23 17:49:24 UTC
Rene: thanks, I have updated the file.
Comment 3 nemeth.lacko 2010-03-12 10:56:20 UTC
I have found a newly introduced problem in hyphenation of OpenOffice.org 3.2. I
have solved the ugliest 1-character distance hyphenation from dashes (eg.
something—t=wo, ad=d-on) by the new release of the improved English dictionaries
(http://extensions.services.openoffice.org/hu/project/dict-en-fixed), but I will
make a new Hyphen release to solve the others.
Comment 4 nemeth.lacko 2010-07-13 11:20:42 UTC
Confirmed by the Slovenian NLP. 

(2010/7/13 Martin Srebotnjak <miles@filmsi.net>:
> Hello, Laszlo and Caolan,
>
> Slovenian users reported having problems with Slovenian hyphenation in
> OpenOffice.org. Mojca Miklavec who worked on updates of LaTeX
> hyphenation reported it already some time ago. Now we tested it and we
> are baffled.
>
> We first noticed problems with words, that have syllables starting
> with our special characters like "č", "š" and "ž" (words like
> "zaživeti", "načeloma" and "rešitev"). Openoffice.org does not offer
> hyphenation before those syllables; some common words with hyphenation
> would be: "za-ži-ve-ti", "na-če-lo-ma", "re-ši-tev"). But we found
> words also without č,š or ž that are not hyphenated properly, like
> "poleteti"; OpenOffice.org splits it like "pole-te-ti", while it
> should be "po-le-te-ti". Here is a nice online tool can display
> current OpenOffice.org hyphenation for Slovenian:
> http://www.ushuaia.pl/hyphen/?ln=en
>
> The same patterns are used in LaTeX and reportedly work fine. We
> checked the file and noticed it was in ISO-1 and not in UTF, but that
> does not seem to be the problem, as I converted them to UTF and had
> same problem. I even created a test dict pack with it (with UTF-8
> hyphenation patterns) here:
> http://dl.dropbox.com/u/4316668/pack-sl.oxt
>
> We tested this on OOO330m0 and on 3.2.1 and on older versions and the
> problems are the same. Obviously this goes on from the start just no
> one noticed it. I first contacted Thomas Lange, and after checking
> that the patterns do include the rules for above mentioned words and
> that the encoding itself might not be the problem, he mentioned, that
> the hyphenation included in OpenOffice.org might not be equal to the
> LaTeX hyphenation. So I looked up who the owners of the
> Hunspell/Hyphen project are and found you. :)
>
> So, I have a plea for help - could you look into Slovenian hyphenation
> rules and Hyphen code at least for these few words and see what the
> problem might be? If it is something trivial we would try to run for
> 3.3 release, otherwise we need to plan needed work for future
> versions.)
Comment 5 nemeth.lacko 2010-11-27 02:39:54 UTC
The new NOHYPHEN feature of Hyphen 2.7 can fix the hyphenation problem of words 
with hyphen characters, also the old one with the apostrophes.
Comment 6 nemeth.lacko 2010-11-27 03:09:15 UTC
Created attachment 75148 [details]
Improved English hyphenation dictionaries
Comment 7 nemeth.lacko 2010-11-27 03:17:32 UTC
Attached English hyphenation dictionaries (improved version of the last English 
hyphenation patterns of OOo) have solved both of the hyphenation problems with 
hyphen and apostrophe characters (1. missing word boundary patterns, ie. TeX 
"1foo." pattern didn't match the "1foo's" in OpenOffice.org (but this was not 
problem for TeX). 2. bad hyphenmin values).
Comment 8 nemeth.lacko 2010-11-27 03:22:17 UTC
(A little correction: "1foo." pattern matched the "barfoo's" word in OOo, thanks 
to a difficult trick, but this was not true for the words with hyphen characters, 
or words with other apostrophe position and combinations.)
Comment 9 thomas.lange 2010-11-29 07:48:12 UTC
tl->nemeth: Is there some action for me to take right now?
Comment 10 nemeth.lacko 2010-11-29 22:08:16 UTC
nemeth->tl: If I right know, this is an important fix for some Indic languages 
with UTF-8 encoded hyphenation patterns. Moreover, the new English hyphenation 
patterns solve several hyphenation problems, too. I would be glad of your help.

Test cases (words with hyphen) for the improved Hyphen library and English 
dictionaries:

old: en=glish-speaker
new: eng=lish-speaker

old: non-metropolitan
new: non-met=ro=pol=i=tan

old: un=der-sh=er=iff
new: un=der-sher=iff

old: twen=ty-one
new: twenty-one
Comment 11 nemeth.lacko 2010-11-29 22:12:17 UTC
Created attachment 75168 [details]
A list with ~450 words with bad (<) and fixed (>) hyphenation.
Comment 12 nemeth.lacko 2010-12-01 10:09:58 UTC
There is a new bug fix release from the library:

http://sourceforge.net/projects/hunspell/files/Hyphen/2.7/hyphen-
2.7.1.tar.gz/download
Comment 13 thomas.lange 2010-12-02 09:54:04 UTC
tl->nemeth: 
We use your extension from
http://extensions.services.openoffice.org/en/project/dict-en-fixed with OOo.
That one has a hyph_en_US.dic and a hyph_en_GB.dic, am I correct to assume that
the word list diff should be applied to both of them?

Also I'm going to keep the extension identifier but will just modify the version
entry to match the current date.
Comment 14 nemeth.lacko 2010-12-02 10:52:55 UTC
nemeth->tl: you are correct, these are the replacements of the latest hyphenation 
patterns. Thanks in advance for the extension modification, too.
Comment 15 thomas.lange 2010-12-03 08:54:34 UTC
Updated hyphen library in OOo to v2.7.1.

Files changed:
M hyphen\makefile.mk
M ooo.lst
A hyphen\hyphen-2.7.1.patch
R hyphen\hyphen-2.4.patch

Applying nohyphenfix.txt patch file still outstanding.
Comment 16 thomas.lange 2010-12-03 08:54:48 UTC
.
Comment 17 thomas.lange 2010-12-03 08:55:17 UTC
.
Comment 18 nemeth.lacko 2010-12-03 15:59:17 UTC
nemeth->tl: many thanks for it.
Comment 19 thomas.lange 2010-12-09 10:24:58 UTC
tl->nemeth: Since there seems to be nothing left to do for this I'm setting this
to fixed. If there is something left to do for me with that word list patch let
me know. 
And thanks for the update! ^_^
Comment 20 thomas.lange 2010-12-10 10:05:33 UTC
TL->SBA: Please verify. Thanks!
Comment 21 stefan.baltzer 2011-01-21 16:39:21 UTC
Verified in CWS tl84.
Comment 22 nemeth.lacko 2011-01-21 23:55:42 UTC
nemeth->tl,sba: many thanks again for the integration and the verification.