Issue 71449 - hunspell: contains large utf_lst table
Summary: hunspell: contains large utf_lst table
Status: CLOSED FIXED
Alias: None
Product: General
Classification: Code
Component: spell checking (show other issues)
Version: 3.3.0 or older (OOo)
Hardware: PC Linux, all
: P3 Trivial (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@lingucomponent
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-11 12:30 UTC by caolanm
Modified: 2013-02-24 20:42 UTC (History)
5 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
how about this... (17.31 KB, patch)
2006-11-11 15:11 UTC, caolanm
no flags Details | Diff
actually, this instead I think, bubble the language down always (17.35 KB, patch)
2006-11-12 14:54 UTC, caolanm
no flags Details | Diff
Unicode test data (to check Å‘s->Ås casing without Hunspell's conversion table) (6.48 KB, application/vnd.sun.xml.writer)
2007-03-22 22:58 UTC, nemeth.lacko
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description caolanm 2006-11-11 12:30:20 UTC
hunspell for spellchecking contains a huge utf_lst table for uppercasing and
lowercasing characters, it apparently covers all of unicode, and for each entry
there the unicode point, and the matching upper/lower points. That's a pretty
big damn table.

we have icu in OOo, and there is uchar.h u_tolower and u_toupper, can we rejig
hunspell to use those at runtime to determine the uppercase and lowercase of a
unicode character and drop this table ?
Comment 1 caolanm 2006-11-11 12:31:52 UTC
reassigning
Comment 2 caolanm 2006-11-11 15:11:21 UTC
Created attachment 40518 [details]
how about this...
Comment 3 caolanm 2006-11-11 15:13:42 UTC
Would that patch fit your needs, ifdef for being inside OOo and use icu
toupper/tolower, use and include the table if a standalone hunspell ?

before: du ../../../unxlngi6.pro/lib/libhunspell.so
212     ../../../unxlngi6.pro/lib/libhunspell.so

after:  du ../../../unxlngi6.pro/lib/libhunspell.so
164     ../../../unxlngi6.pro/lib/libhunspell.so
Comment 4 caolanm 2006-11-12 14:54:45 UTC
Created attachment 40531 [details]
actually, this instead I think, bubble the language down always
Comment 5 nemeth.lacko 2006-11-13 10:30:51 UTC
Target: 2.2

Caolan: I'm very glad of your nice patch. I will put it into Hunspell 1.5 and
make a CWS. Thank you very much! Laci
Comment 6 mmeeks 2006-11-13 13:53:04 UTC
Hi Caolan, nice work :-)

OTOH - the huge memory chew we see from loading the dictionaries is prolly more
significant.

For myspell we had a nice patch: i#50842# that mmapped the spelling
dictionaries, and saved nearly 3Mb for an en-US locale.

It mostly involved some changes to the various string routines to terminate on
newline/special-character instead of '\0' - and well, we've never got around to
porting it to hunspell sadly.
Comment 7 nemeth.lacko 2006-11-14 01:45:30 UTC
Unfortunatelly, I couldn't use Michael's patch to Hunspell.

I plan a build-time dictionary pre-compression for OpenOffice.org.
For example, using alias compression of the integrated Hunspell, nearly 3/4 MB
RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for hu_HU (17->14), and 9 MB for
Arabic (18->9).

Thomas: OOo doesn't use shared dictionaries, if I run different OOo processes on
my Linux machine. Thomas, may I need network installing or something special
parameter to share the dictionaries between the processes? I believe, you have
mentioned the dictionary sharing on the Lingu-dev.

Comment 8 mmeeks 2006-11-14 09:36:34 UTC
Hi there,

> I plan a build-time dictionary pre-compression for OpenOffice.org.
> For example, using alias compression of the integrated Hunspell,
> nearly 3/4 MB RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for
> hu_HU (17->14), and 9 MB for Arabic (18->9).

So - the main memory win for us came, not from shrinking the size of the
dictionary on disk, but from not duplicating all those strings into malloc'd
memory [ which has a substantial malloc overhead per string ].

Also - of course for thin-clients, the mmapped memory is shared, where heap
allocated memory cannot be, so we win yet more.
Comment 9 tml 2006-11-14 09:53:27 UTC
I worked on the attempt to use a similar memory-mapping approach for hunspell,
as for the earlier code, but unfortunately it was much uglier. I could check if
I can find the attempt still on disk somewhere, if people are interested.
Comment 10 nemeth.lacko 2006-11-14 10:25:17 UTC
I believe, the most efficient and flexible method to generate build-time memory
footprints (in fact, spec. binary datafiles) from OOo dictionaries, and use it
run-time by mmap, similar to Python byte code compilation and usage (py->pyc).

Comment 11 pavel 2007-01-15 19:24:09 UTC
any update on status?
Comment 12 nemeth.lacko 2007-01-18 09:15:20 UTC
Fixed. (I will put it in CWS hunspell2 this day.)
Comment 13 nemeth.lacko 2007-03-22 22:56:01 UTC
Test: size of libhunsell.so is ~133 kB instead of 180 kB (removed Unicode casing
table), but spell checking works with Unicode dictionaries and data.

(Attachment: Hungarian Unicode test data
Test environment: Hungarian aff and dic file from OpenOffice.org CVS
(dictionaries/hu_HU/hu_HU*) or a simple
====hu.aff====
SET UTF-8
==============

and

====hu.dic====
1
Å‘s
==============

and add

DICT hu HU hu

to the dictionary.lst.)
Comment 14 nemeth.lacko 2007-03-22 22:58:35 UTC
Created attachment 43882 [details]
Unicode test data (to check Å‘s->Ås casing without Hunspell's conversion table)
Comment 15 nemeth.lacko 2007-03-22 22:59:47 UTC
SBA: Thanks your help in advance, Laci.
Comment 16 nemeth.lacko 2007-08-02 16:51:06 UTC
I will reopen this issue after Hunspell integration, because Windows build
doesn't work with this patch, so I have switched off it for Windows in CWS
hunspell2. It seems in OpenOffice.org Wiki (ICU), Windows need special
configuration (http://wiki.services.openoffice.org/wiki/ICU), but using ICU
is not recommended.

For future developments, in comments of CWS hunspell2 Thomas has suggested to
use OOo internal Unicode functions:

> TL->Laci: The usual way to make uppercase/lowercase conversion or isAlpha test
> would be to make use of CharClass ans SysLocale.
> See unotools/charclass.hxx and svtools/syslocale.hxx
> It is used like 
>   GetSysLocale().GetCharClass()....
> CharClass has all the functions you like, though usually for strings...
> ER also recommended to use those functions.

Comment 17 nemeth.lacko 2007-08-06 15:00:37 UTC
new target: 2.4
Comment 18 stefan.baltzer 2007-12-11 15:25:32 UTC
SBA: Verified in CWS hunspell2.
Comment 19 thorsten.ziehm 2009-07-20 14:52:21 UTC
This issue is closed automatically and wasn't rechecked in a current version of
OOo. The fixed issue should be integrated in OOo since more than half a year. If
you think this issue isn't fixed in a current version (OOo 3.1), please reopen
it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues