Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing
|Summary:||Language fingerprint for Luxembourgish (File attached)|
|Status:||CLOSED FIXED||QA Contact:||issues@lingucomponent <issues>|
|Version:||OOo 3.0 Beta|
|Target Milestone:||OOo 3.2|
|Issue Type:||PATCH||Latest Confirmation in:||---|
Description michel_w 2008-06-29 20:31:10 UTC
Hi I've built a fingerprint file for Luxembourgish (a minority language spoken in an around Luxembourg). The file is encoded in UTF-8. It would be nice if this could make it into the OOo3 final. Regards, Michel Weimerskirch
Comment 1 michel_w 2008-06-29 20:32:15 UTC
Created attachment 54814 [details] Language fingerprint for Luxembourgish
Comment 2 thomas.lange 2008-06-30 09:04:49 UTC
TL: Sorry it is a little bit late for OOo 3.0 since code freeze is already dead ahead and all CWS basically should go into QA by today (which is already quite late). Setting target to OOo 3.1 and taking ownership since I probably have to do some code changes as well. TL->michel_w: Would a guessed locale with the language part 'lb' only be sufficient or are there variants to Luxembourgish and we need to set e.g. the country part as well?
Comment 3 michel_w 2008-06-30 09:21:14 UTC
@TL: Well, too bad... I guess I should have paid attention to the release plan. But well, better late than never ;-) As to your question: The language part "lb" is sufficient. There are no variants. Regards, Michel
Comment 4 thomas.lange 2008-09-09 13:36:09 UTC
Comment 5 thomas.lange 2009-01-20 10:53:12 UTC
As discussed with SBA setting target to OOo 3.2 as time is already short and may CWS are still in the queue.
Comment 6 michel_w 2009-01-23 10:28:13 UTC
That's really a pitty. I have implemented a proofreader for Luxembourgish using the new grammar checker API of OOo 3.0.1. It will be released during the next weeks. A spell checking dictionary is already available. I was hoping that language guessing for Luxembourgish could at least be part of OOo 3.1, as it is an integral part of the user experience for proofreading multilingual texts. Delaying that feature by a few more months is unfortunate.
Comment 7 thomas.lange 2009-01-23 10:41:52 UTC
Yes it. But AFAIK there are already more then 60 CWS in the queue for 3.1 and we already need to cut short on those. Thus it was decided that only serious issues should be fixed now. :-(
Comment 8 thomas.lange 2009-02-02 12:07:30 UTC
tl->michel_w: Since we missed OOo 3.1 I want to take care of this right away now. Can you attach some short Luxembourgish text sample that can be used to verify the result?
Comment 9 michel_w 2009-02-02 13:29:28 UTC
michel_w->tl: Ok, thanks. Basically any text from the Luxembourgish Wikipedia will do: http://lb.wikipedia.org I'm going to attach one that I've more-or-less randomly selected. (You said "short", so I hope it's ok).
Comment 10 michel_w 2009-02-02 13:30:39 UTC
Created attachment 59831 [details] Luxembourgish text sample (from lb.wikipedia.org)
Comment 11 thomas.lange 2009-02-03 13:28:26 UTC
Fixed in CWS tl66. Files changed: - scp2\source\ooo\file_ooo.scp - scp2\source\ooo\module_hidden_ooo.scp - libtextcat\data\new_fingerprints\fpdb.conf - libtextcat\data\new_fingerprints\lm\luxembourgish.lm
Comment 12 thomas.lange 2009-02-03 13:48:23 UTC
tl->michel_w: first I worried a bit that the fingerprint file might not work since it is rather different from the other ones used in Ooo (see e.g. Sun\StarOffice 9\Basis\share\fingerprint\swedish.lm) since it dis not have the second column with the numbers. But still it seems to work fine without them. Just curious: Where did you get it? Or with what tool did you create it? I'm asking because in OOo there are still some broken fingerprints that can not be used. And maybe, those can be recreated to do their work...
Comment 13 michel_w 2009-02-03 17:10:01 UTC
michel_w->tl: I created it myself using text_cat (http://odur.let.rug.nl/~vannoord/TextCat/). Text_cat is based on this paper: http://citeseer.ist.psu.edu/68861.html (Figure 3 explains the algorithm quite well). The numbers in the second column represent the number of times the n-grams appears in the original sample. They are not used by the algorithm and can thus be safely removed AFAIK (which is what I did).
Comment 14 thomas.lange 2009-05-05 06:31:57 UTC
Comment 15 stefan.baltzer 2009-05-07 09:38:02 UTC
Verified in CWS tl66.