Issue 51661

Summary: Quote marks in 2.0 Hebrew workbreaking
Product: Internationalization Reporter: alan
Component: codeAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues@l10n <issues>
Severity: Trivial    
Priority: P3 CC: issues, ooo, yba
Version: OOo 2.0 Beta   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: PATCH Latest Confirmation in: ---
Developer Difficulty: ---
Description Flags
Changes to existing breakiterator code
New file rules for Hebrew wordbreaking
Hebrew test file withquotes for Karl
Sample doc - note misspelled words broken by a double-quote
9 different quote characters in Hebrew words
Hebrew dictionary files and dictionary.lst none

Description alan 2005-07-07 09:46:24 UTC
Hebrew workbreaking in 1.1 does not see a double quote mark as the end of a
word. This is correct behavior. In 2.0 beta m104, this does not happen, and a
quote make *is* seen as as a word-breaker.

See the thread at:
for details.

I have tried to integrate my changes to 1.1 into m104, but unsuccesfully. I'm
posting the changes I made to m104, so that others can examine them, and find
out what's wrong or missing.
Comment 1 alan 2005-07-07 09:49:23 UTC
Created attachment 27755 [details]
Changes to existing breakiterator code
Comment 2 alan 2005-07-07 09:50:35 UTC
Created attachment 27756 [details]
New file rules for Hebrew wordbreaking
Comment 3 ooo 2005-07-07 12:16:47 UTC
Grabbing issue.
Comment 4 ooo 2005-09-05 18:09:36 UTC
Hi Karl,

As I won't find the time to dive into this the next days/weeks, could you please
have a look at this one? See also the mails on the dev@l10n list of the thread
starting with the message mentioned above. If there is an easy solution, just
add it to my CWS locales201.

Comment 5 karl.hong 2005-09-23 01:03:08 UTC
I assume that both edit and dictionary modes need to treat double quote as part
of the word. I create two files edit_word_he.txt and dict_word_he.txt.

I test it on other language, if someone could upload a Hebrew file with double
quote for me to test, that will be great. Thanks in advance.
Comment 6 alan 2005-09-23 09:11:18 UTC
Created attachment 29837 [details]
Hebrew test file withquotes for Karl
Comment 7 karl.hong 2005-09-23 19:13:08 UTC
Thanks, Alan. 

Ready for QA.

re-open issue and reassign to
Comment 8 karl.hong 2005-09-23 19:13:14 UTC
reassign to
Comment 9 karl.hong 2005-09-23 19:13:25 UTC
reset resolution to FIXED
Comment 10 oc 2005-09-26 15:50:50 UTC
Hi Stefan, please take over

re-open issue and reassign to
Comment 11 oc 2005-09-26 15:51:02 UTC
reassign to
Comment 12 oc 2005-09-26 15:51:13 UTC
reassign to
Comment 13 oc 2005-09-26 15:51:21 UTC
reset resolution to FIXED
Comment 14 alan 2005-09-28 12:53:25 UTC
Karl, there still seems to be a problem when I try to spellcheck the sample
document. I'm attaching a screenshot. Note that in the 6th and 7th lines, toward
the left, there are two identical words, one above the other, that are marked as
misspelled. Those words have double-quotes in the middle, but the red line stops
at the double-quote. It should continue past the quote, to the end of the word.
The same problem exists in the text in the top-right cell of the table. Also in
the left cell of the table's second row.
Comment 15 alan 2005-09-28 12:56:52 UTC
Created attachment 29960 [details]
Sample doc - note misspelled words broken by a double-quote
Comment 16 stefan.baltzer 2005-10-06 17:07:08 UTC
SBA->ayaninger: When I compare the CWS build and an OOo installation WITHOUT
this break iterator patch, I see no difference in treatment of quotes. 
I will attach a document with a couple of "quotes" (single and double). Their
Unicode IDs are 2018, 2019, 201B, 201C, 201E, 05F2, 05F4, 05D9, 05F3. 

Please comment
(1) wich ones should be treated as "character" and wich ones as "quote" (=word
(2) Wich ones are commonly used (= can be inserted directly) when typing Hebrew? 

Subsequently (difference=none), I must regard this issue as "not fixed".
-> Back to NEW and reassigned to Karl.

re-open issue and reassign to
Comment 17 stefan.baltzer 2005-10-06 17:07:31 UTC
reassign to
Comment 18 stefan.baltzer 2005-10-06 17:07:38 UTC
reset resolution to FIXED
Comment 19 stefan.baltzer 2005-10-06 17:09:33 UTC
Comment 20 stefan.baltzer 2005-10-06 17:14:46 UTC
Created attachment 30187 [details]
9 different quote characters in Hebrew words
Comment 21 karl.hong 2005-10-06 18:46:29 UTC
Karl->SBA, None of your quotes is what they want. They want english, or ASCII,
double quote (0022). You can see it as $MidLetter in the attachment of "New file
rules for Hebrew wordbreaking".

I made both word type mode, dictionary and edit modes,  take (0022) as mid
letter. In Alan's attached document , HebrewQuoteTest.odt, when you do word
travel by (Cntr->Arrow key), you will see (")  is part of a word as mid letter. 

Karl->Alan, I don't have Hebrew spellchecker, I could not see what you see in
your screen shot. As to test word break in spellchecker, which uses
DICTIONARY_WORD mode, here is StarBasic program, you can change to different
language and get different word boundary,

Sub Main
dim lo as new

bd=bi.getWordBoundary(st, 0, lo, ty, TRUE)
print st, bd.startPos, bd.endPos

bd=bi.getWordBoundary(st, 0, lo, ty, TRUE)
print st, bd.startPos, bd.endPos

End Sub

Comment 22 alan 2005-10-07 12:02:05 UTC
Yes, Karl is correct, we are referring to the English ASCII double-quote.
However, it would be proper to treat all the other characters you listed in the
same way, as "characters", and not as word-breakers. Take a look at Jonathan
Ben-Avraham's background explanation, which I quoted in my comments to Issue 51772.

Word travel using Ctrl-<Arrow> does jump over the quote marks. I ran your
StarBasic program, and saw the results, which also show that the quote marks do
not break the word. Nevertheless, in spell checking the word is broken at the
quote marks. I am attaching Hebrew dictionaries and dictionary.lst, which you
can install in share/dict/ooo, so you can take a look.
Comment 23 alan 2005-10-07 12:05:12 UTC
Created attachment 30203 [details]
Hebrew dictionary files and dictionary.lst
Comment 24 stefan.baltzer 2005-10-11 11:45:32 UTC
SBA: I correct the status to "Fixed". Thomas Lange is digging a little into
Karls code in order to find out why the hebrew spellchecker is not accepting the
entire word (with ASCII 0022) while cursor travelling behaves like "this is one
word". Tho outcome will probably lead to another issue that will not be fixed
within this CWS.
Comment 25 stefan.baltzer 2005-10-12 15:28:42 UTC
SBA: Verified in CWS i18n20.
Follow up is issue 51772.
Comment 26 stefan.baltzer 2006-03-22 17:17:52 UTC
SBA: OK in Master (and still OK in OOo 2.02).