Issue 43583

Summary: RFE: Spellchecker API doesn't appropriate for languages without any space between words like Thai
Product: App Dev Reporter: samphan
Component: apiAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: arthit, davidf, eleonora46, hin.stone, issues, jjc, markpeak, nusorn, tantai
Version: 3.3.0 or older (OOo)Keywords: oooqa
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: FEATURE Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on:    
Issue Blocks: 41707    

Description samphan 2005-02-26 14:17:53 UTC
Thai text doesn't have any space between words. This usually break assumption in
some codes. The API com.sun.star.linguistic2.XSpellChecker is one of them.
XSpellChecker is the the interface for spell checking. It has two method :-

bool isValid(string aWord, Locale, aProperties);
XSpellAlternatives spell(string aWord, Locale, aProperties);

see
http://api.openoffice.org/docs/common/ref/com/sun/star/linguistic2/XSpellChecker.html

Normally isValid() is used to check whether a word is correctly spelled. Then if
the word is incorrectly spelled, spell() will be used to get suggested spelling
alternatives from the dictionaries. This works as long as you get the boundary
of the spelling error right, which is the case for Western text. However, for
text in the languages without any space between words like Thai, it's usually
impossible to know the boundary of the spelling error correctly (before trying
to find suggestions).

I'll use English but without space between words as an example :-

Theytrytomanufacturetables. (They try to manufacture table)
	In case the word 'manufacture' is misspelled as 'manifacture'
Theytrytomanifacturetables.
	The word breaker (which works with Thai) will break the words like this :-
They|try|to|man|ifacture|tables|.
	Each word will pass isValid() until 'ifacture'. The string 'ifacture' will be
flagged as misspelled. Calling spell() you may get 'facture' as a suggestion.
Which is incorrect!
Theytrytomanfacturetables. (They try to manfacture tables).

This is because XSpellChecker doesn't have the whole picture of the input
strings. It only sees one segmented word (sending from ICU word breaker) at a
time. It can not do any better.
To make the situation worse, ICU DictionaryBasedBreakIterator uses a dictionary
for Thai which will be different from the one used by the Thai spellchecker.
(Thai spellcheckers have been implemented in Thai versions of OOo like
OpenOfficeTLE and Pladao but the quality is not good enough to be usable because
of the issues mentioned here).


How to do it appropriately for languages without space between words:-
1) send the whole string to spellchecker
2) have a function to iterate the misspelled words in the string

	For example :-

class XSpellChecker2
{
	XSpellAlternatives* spell(const OUString& string, int start, const Locale&,
aProperties, int& errorBegin, int& errorEnd);
}

	Then you can loop thru spelling errors by:-

int begin, end, i = 0;
while( i < str.getLength() )
{
	XSpellAlternatives* xAlt = spellchecker2.spell(str, i, locale, emptyProps,
begin, end);
	if (xAlt != NULL) {
		OUString spellingError = str.copy(begin, end-begin);
		DisplayAlternatives(spellingError, xAlt);
	}
	i = end;
}


How this solve the problem for Thai.:-
	This is an algorithm for Thai.

They|try|to|man<ifacture>tables|.

	XSpellChecker2::spell() for Thai will iterate thru the correct words until
'ifacture'. Then it will try to find the suggestions for :-
- 'ifacture', found 'facture'
- then plus one word before, 'manifacture' found 'manufacture'
- then plus one word after, 'ifacturetables' - not found
- then plus one word before and after, 'manifacturetables' - not found
	The algorithm will select the result with the longest misspelling -
'manifacture' and suggest 'manufacture', correctly.
e.g.
xAlt = spellchecker2.spell("Theytrytomanfacturetables", 0, locale, emptyProps,
begin, end);
// begin = 9, end = 19, xAlt contains 'manufacture'

An example in Thai:-
ใช้คอมพวเตอร์ได้
	The word คอมพิวเตอร์ is misspelled as คอมพวเตอร์. However คอ is a word in Thai.
	Segmented as :-
ใช้|คอ<มพวเตอร์>ได้

Old algorithm :- 
	flag 'มพวเตอร์' as misspelling, find no suggestion.
New algorithm :-
- Word breaking found that the segment "มพวเตอร์" is not a Thai word
- try "มพวเตอร์", fail
- try "คอมพวเตอร์" suggest "คอมพิวเตอร์"
- try "มพวเตอร์ได้", fail
- try "คอมพวเตอร์ได้", fail
- Found that misspelling is "คอมพวเตอร์", suggest "คอมพิวเตอร์"
Comment 1 arthit 2005-02-28 19:42:03 UTC
confirmed.
Comment 2 jf6386 2005-03-01 13:55:57 UTC
Hi,
I think this problem happens not only in oriental languages (Thai, 
Chinese, ...) but probably happens in all languages that join words, e. g. 
German.
Comment 3 eleonora 2005-03-02 08:47:46 UTC
For languages, that do use spaces as word limit, but use compound words, the 
collection of the whole vocabulary is the right way. see:
http://tkltrans.sourceforge.net/tklspell/compound.htm for explanation.
This study also might show some pitfalls with the here suggested algorithm. I 
admit, I have no better solution for Thai, than Samphan's suggested one. 

I think, for languages where writing is more word based, like Chinese or 
Japanese, Samphan's solution is quite good, for languages, where writing is 
more character based, there are some serious pitfalls, and there Samphan's  
suggested solution is not very good.
Eleonora
Comment 4 eleonora 2005-03-02 12:40:14 UTC
Some thoughts.
English examples:
She arab bin ary
she a rabbi nary
shea rabby nary
shear abby nary   --- abby could be wrong

There are innumerous possibilities for wrong 
and good words in a simple sentence.
Where should the loop continue?
after she? shea? shear?
Comment 5 eleonora 2005-03-03 11:11:20 UTC
Samphan,
This problem is not a standard spell checking problem. You must provide a 
besides the standard dic/aff word files also a sentence-to-word-breaking 
program or subroutine, that breaks down each sentence to words. Then the spell 
checker can check it according to the standard rules, offering the 
replacements, if any. Your word breaking algorithm must use a dictionary, that 
enables you to break the sentence into words, maybe a POS (Part of Sentence) 
tagger, that helps to find the optimal breaking algorithm. The final 
functionality would look:
myspell gets the sentence - myspell calls your breaker program/subroutine - 
myspell checks the into words broken sentence, and passes back the 
suggestions, if any.
myspell gets the next sentence, etc....
The only support you can expect from the spell checker is to provide the 
interface to your sentence breaking program/subroutine. 
Comment 6 eleonora 2005-03-03 12:51:08 UTC
Samphan, if and when using ICU word breaker, you should better make sure that 
it uses the same dictionary as myspell. This would narrow down the problems 
like They|try|to|man|ifacture|tables|, you mention. Since ICU is an open 
source project, this should be possible.
Comment 7 nemeth.lacko 2005-03-03 13:04:50 UTC
I suggest to extend or replace the ICU DictionaryBasedBreakIterator to/with a
more sophisticated (Thai)POSTaggerAndSpellCheckerSuggestionBasedBreakIterator.
I think, this is not a spell checking problem, but for a better word breaking
you need the help of the spell checker's suggestion mechanism too.

But I can imagine a simple (half) solution too. Modify, extends or replace the
ICU DictionaryBasedBreakIterator to break an unknown word with its known
neighbours in case of Thai texts:

Thentrytomanifacturetables ->
(They|try|to|man|ifacture|tables|)->
They|try|to|manifacturetables|

MySpell checks the "manifacturetables" as a compound word, and
suggests "manufacturetables". Not so pretty, but works. Don't you
need new API's.

(Perhaps it's not a right solution, because it modifies the
hyphenation or type-setting. I don't know.)

Laci
Comment 8 nemeth.lacko 2005-03-03 18:54:56 UTC
> (Perhaps it's not a right solution, because it modifies the
> hyphenation or type-setting. I don't know.)

This is not problem, if you correct the word, or put the unknown word into the
custom dictionary, because the breakiterator re-count the word splittings.
But first this is not so trivial for users.

Laci


Comment 9 Martin Hollmichel 2005-05-22 07:30:52 UTC
set target to OOo Later.
Comment 10 falko.tesch 2005-10-20 20:57:49 UTC
FT: I'm leaving so I will re-assign this issue to requirement default user
Comment 11 ace_dent 2008-05-16 01:42:33 UTC
OpenOffice.org Issue Tracker - Feedback Request.

The Issue you raised is currently assigned to 'Requirements' pending review, but
has not been updated within the last 2+ years. Please consider re-testing with
one of the latest versions of OOo, as the problem(s) may have already been
addressed. Either use the recent stable version:
http://download.openoffice.org/index.html
or consider trying the new OOo 3 BETA (still in testing):
http://download.openoffice.org/3.0beta/
 
Please report back the outcome so this Issue may be Closed or Progressed as
necessary - otherwise it may be Resolved as Invalid in the future. You may also
wish to search for (and note) any duplicates of this Issue that may have
advanced further by checking the Issue Tracker:
http://www.openoffice.org/issues/query.cgi
 
Many thanks,
Andrew
 
Cleaning-up and Closing old Issues as part of:
~ The Grand Bug Squash, pre v3 ~
http://marketing.openoffice.org/3.0/announcementbeta.html
Comment 12 samphan 2009-08-10 08:18:02 UTC
Confirm that the issue still valid
Comment 13 sungkhum 2011-02-06 14:18:20 UTC
Just starting working on a solution for Khmer word breaking with the friendly
folks at ICU - hopefully there will be some good progress!