Issue 103402

Summary: need to skip diacritics in Hebrew spellchecking
Product: General Reporter: alan
Component: spell checkingAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: amiadb, elisko, issues, kaplanlior, nemeth.lacko, okhayat, yba
Version: 3.3.0 or older (OOo)   
Target Milestone: ---   
Hardware: Unknown   
OS: All   
Issue Type: PATCH Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
proposed patch
none
revised - changed a < to <= none

Description alan 2009-07-08 07:18:20 UTC
Hebrew is usually written without diacritics. However, sometimes the diacritics
are written as special marks located within, above, or below consonants. The
diacritics are represented internally as separate Unicode characters. Hebrew
dictionaries check for words without diacritics and will continue to do so for
the foreseeable future.

This patch filters the diacritics out of a word, before spellchecking it.
(Using breakiterator is not appropriate, since we don't want word-breaking at
the diacritics)

I don't know whether this functionality is needed for other languages as well,
perhaps Arabic or Persian, or maybe some LTR languages. The patch is written in
a generalized way, so that adding a language is fairly easy:

1) add another "case LANGUAGE_WHATEVER" to the "switch (nLanguage)" statement,
and create a string with the diacritics to be skipped 

2) add "|| nLanguage == LANGUAGE_WHATEVER" to the assignments of the boolean
variables
Comment 1 alan 2009-07-08 07:19:35 UTC
Created attachment 63421 [details]
proposed patch
Comment 2 alan 2009-07-08 07:35:24 UTC
Created attachment 63425 [details]
revised - changed a < to <=
Comment 3 elisko 2009-07-08 10:34:58 UTC
As I understand it, Sanskrit-based languages such as Hindi also employ diacritics.
Comment 4 kaplanlior 2010-08-14 19:02:09 UTC
#99796 has a very similar problem, I think the two should be fixed together
(probably the same code). Notice this is not the same problem, just a similar one.
Comment 5 thomas.lange 2010-08-18 07:45:17 UTC
taking ownership as well.

tl->ayaniger: If you provide patches for the linguistic please assign them
directly to me, if by bad luck I may not see them in the issues ML and nobody
else is assigning them to me they will just loiter around, probably until
someone else makes a new comment and have them appear in the ML once more.

tl->nemeth: won't it be possible to take care of this in the spell check
dictionary or hunspell itself? I'm just asking because removing them in the
SpellCheckerDispatcher will have the following two side effects:

a) the replacement word will probably also not provide diacritics which may look
somewhat odd if all the surrounding text is using them.

b) if there ever were another spell checker implementation for Hebrew that could
properly work with diacritics and provide them in replacements as well, then the
patch will effectively suppress that feature. 

Thus I'm a little hesitant until told this patch actually has to be the solution
to take.
Comment 6 kaplanlior 2010-08-21 18:16:21 UTC
#51772 also has a very similar problem, I think the two should be fixed together
(probably the same code). Notice this is not the same problem, just a similar one.
Comment 7 Martin Hollmichel 2011-03-16 11:56:13 UTC
set target 3.x not relevant for 3.4 release
Comment 8 Rob Weir 2013-03-11 15:01:35 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list:  dev@openoffice.apache.org

Thanks!

-Rob
Comment 9 kaplanlior 2013-03-12 11:44:15 UTC
The patch is Hebrew specific, I think it should be more general.