Issue 74034 - Alternative BreakIterator_ja based on morphological analysis
Summary: Alternative BreakIterator_ja based on morphological analysis
Status: CONFIRMED
Alias: None
Product: Internationalization
Classification: Code
Component: i18npool (show other issues)
Version: 680m201
Hardware: All Linux, all
: P3 Trivial with 1 vote (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-31 06:59 UTC by bluedwarf
Modified: 2017-05-20 11:27 UTC (History)
8 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
The proposed patch for i18npool module (6.96 KB, patch)
2007-01-31 07:01 UTC, bluedwarf
no flags Details | Diff
The improved patch (added switch option) (9.32 KB, patch)
2007-02-01 02:20 UTC, bluedwarf
no flags Details | Diff
The copyright notice of "ipadic" (3.71 KB, text/plain)
2007-02-01 02:57 UTC, bluedwarf
no flags Details
Improved patch (depends on i76536) (10.85 KB, patch)
2007-04-20 09:05 UTC, bluedwarf
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description bluedwarf 2007-01-31 06:59:06 UTC
I've created a patch which provides an alternative BreakIterator_ja based on
morphological analysis by MeCab. MeCab is a great Japanese morphological
analyser available from

 http://mecab.sourceforge.net/

And it gives so accurate results on analysis of Japanese sentence that this
patch provides more sophisticated BreakIterator for Japanese than the current
one based on just a word dictionary.

However, this patch still has some problems on the integration to the OOo
vanilla source tree. Would you give me some ideas for solving the following
problem? I'm not so familiar with the OOo development.

Problem 1:
The MeCab library is not installed by default on most Linux system. We need at
least the configure option to determine whether OOo links to MeCab or not.

Problem 2:
The MeCab is thread safe library as long as one MeCab::Tagger instance is not
shared by multiple threads. That means my patch is also safe as long as one
BreakIterator_ja instance is not shared by multiple threads. If it is shared by
multiple threads, it requires mutex behavior.
Comment 1 bluedwarf 2007-01-31 07:01:47 UTC
Created attachment 42584 [details]
The proposed patch for i18npool module
Comment 2 pavel 2007-01-31 07:03:51 UTC
reassign to Karl.
Comment 3 maho.nakata 2007-01-31 07:30:00 UTC
bluedwarf:
For problem1:
Ask mh for legal issues. I think integration as an external project might
not so difficult. According to http://mecab.sourceforge.net/ , mecab
is licensed under LGPL/GPL/BSD. I believe there are not so many problem.
BTW: we need dictionary. How do we prepare?
Comment 4 ooo 2007-01-31 10:35:03 UTC
The first obstacle I see in using MeCab is that all availabe documentation seems
to be in Japanese only. Who would be going to maintain that? Second, as Maho
mentioned, we seem to need a dictionary.

Btw, the attached patch replaces the current implementation (and somehow
erroneously duplicates the BreakIterator_ko ctor/dtor). If we decided to use
MeCab I would prefer making it a configurable alternative instead, as long as we
don't know whether is suits our needs or works on every platform we support.

There also seems to be room for improvement in MeCab itself, ucstable.h comes
with three static encoding tables containing 64k short int entries each
consisting mostly of 0x0000, this makes up 384k of nearly wasted memory.. Maybe
we could reuse our own textencoding converters intead, or make use of the
MECAB_USE_UTF8_ONLY that wouldn't need these tables. Which raises another
question: why is this back-and-forth conversion between UCS2 and EUC_JP
(respectively maybe UTF-8) needed at all if MeCab internally uses UCS2 anyway?
It seems it is lacking an interface for UCS2.
Comment 5 bluedwarf 2007-02-01 02:20:29 UTC
Created attachment 42619 [details]
The improved patch (added switch option)
Comment 6 karl.hong 2007-02-01 02:32:20 UTC
.
Comment 7 bluedwarf 2007-02-01 02:55:35 UTC
Thanks maho and er for your comments.

I've added switch option into the new improved patch in order to decide whether
OOo links to MeCab. If you want to enable MeCab, set ENABLE_MECAB environmental
variable "TRUE". As concerns to the lacking interface of MeCab, I'll try to
create a new patch for MeCab later.

The license problem of dictionary is confusing. The one dictionary that MeCab
developer recommends is "ipadic" licensed under the nearly Open Source License.
I will attach the license terms later.

Even if the dictionary cannot be integrated, there is not so many problem. In
case that a dictionary is not installed properly, The MeCab::createTagger method
returns NULL and the new improved patch use the fallback codes that is based on
the current BreakIterator. Users themselves may install a dictionary in order to
use alternative MeCab based BreakIterator.
Comment 8 bluedwarf 2007-02-01 02:57:06 UTC
Created attachment 42621 [details]
The copyright notice of "ipadic"
Comment 9 maho.nakata 2007-02-01 02:59:20 UTC
bluedwarf:
Please ask mh for legal review. Others are not responsible for this...
Comment 10 maho.nakata 2007-02-01 03:01:51 UTC
bluedwarf:
ftp://ooopackages.good-day.net/pub/OpenOffice.org/sources/OOo_1.1.5m58_source.tar.bz2
we have ./i18npool/source/breakiterator/data/ja.dic
.
do you think we can use it?
I don't like such fallback method as the result can be different system by system
.

and - License term for IPADIC, please discuss with mh. not here.
Comment 11 bluedwarf 2007-02-01 03:15:08 UTC
maho:
OK, I'll write a mail to mh about legal issues.
And ja.dic is just a word list. It's not a dictionary for morphological analysis.


Comment 12 bluedwarf 2007-02-22 08:40:44 UTC
I discussed MeCab integration with MeCab developer. He told me helpful
advice like following.

=====
First of all, MeCab is designed to be independent from specific character
encodings. So it works correctly while the character encoding of input
string is the same as the one of the dictionary.

Thus, in principle, we can pass UCS-2(BE|LE) string to MeCab by the
current interface without having to create a new interface if we
encoded the MeCab dictionary by UCS-2(BE|LE). However, we need a lot
of modifications to support UCS-2(BE|LE) dictionary because MeCab uses
"char *" string and considers 0x00 as the end of string. 

In addition, the comment "All internal codes are represented in UCS2,"
in ucs.h implies that MeCab calls *_to_ucs2 functions to determine the
type of characters included in unknown words. The process for known
words and the one for unknown words are distinct. Only the latter
calls *_to_ucs2.
=====

In fact, MeCab doesn't encode and decode all UTF-8 strings by UCS2 in
vain. Writing patches for the problem seems to be very difficult and, in my
humble opinion, such patches don't affect on OOo's performance. In
conlusion, it is the practically best that OOo passes UTF8 string to
MeCab. Of course, we should set MECAB_USE_UTF8_ONLY = 1 in order to
remove useless conversion table.

I'm sorry but the legal issue has not been solved yet. The external
project pages told me how to integrate external source codes, so I
canceled a mail to mh and I will follow the instruction written in
external project website.
Comment 13 ooo 2007-04-16 16:36:25 UTC
Martin, what is the progress with the legal affairs?
Comment 14 bluedwarf 2007-04-20 09:05:31 UTC
Created attachment 44561 [details]
Improved patch (depends on i76536)
Comment 15 bluedwarf 2007-04-20 09:15:21 UTC
The new patch doesn't link to system library but depends on libmecab.so
generated by new top level module "mecab". See i76536 for details of the new module.

And to make the alternative BreakIterator available, OOO_MECAB_DIC_DIR
enviromental value must be set correctly before launch OOo. This value
represents the directory where MeCab dictionary are installed (for example,
/usr/local/lib/mecab/dic/ipadic).
Comment 16 ooo 2007-08-28 14:51:53 UTC
Martin (quarterly nag screen popping up),

What is the progress with the legal affairs for this library?

  Eike
Comment 17 Mathias_Bauer 2009-06-10 11:11:51 UTC
As khong no longer works on OOo, we must find somebody else. As the legal
questions still have not been answered, I now assign the issue to mh and in the
meantime will try to find out who can take over in case the legal part has been
solved.
Comment 18 Martin Hollmichel 2009-06-10 11:28:37 UTC
I don't see any problems to include the mecab src under the terms and conditions
of the BSD license, please proceed.
Comment 19 Mathias_Bauer 2009-06-12 16:03:06 UTC
@bluedwarf: are you still around? Now as khong isn't working on OOo anymore it
seems that my team will take that over. I'm afraid that we will not be able work
on that with your help.

What do you think about the following plan: we integrate the patch for 3.2 and
rework it a bit so that the old and the new break iterator can both be used,
selectable by a switch. Default will be the old one. We can provide an extension
doing the switch to the new one so that we can give the new break iterator a
broader test based on OOo 3.2.
Comment 20 bluedwarf 2009-06-12 18:29:14 UTC
I'm still here. Thank you for you kind assistance.

Your plan seems good, go on. If you have some problems in my patch, let me know
and I will take care of it.
Comment 21 Pedro Giffuni 2011-12-01 15:38:39 UTC
FWIW, Oliver is working on the break iterator as part of the IP Clearance.
Comment 22 Oliver-Rainer Wittmann 2011-12-01 16:03:16 UTC
I am sorry - currently I have no clue about the break iterator.
Comment 23 Rob Weir 2013-03-11 15:03:44 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list:  dev@openoffice.apache.org

Thanks!

-Rob
Comment 24 Marcus 2017-05-20 11:27:35 UTC
Reset assigne to the default "issues@openoffice.apache.org".