Issue 85411 - ZWSP is no longer considered as a word separation character for spellchecking
Summary: ZWSP is no longer considered as a word separation character for spellchecking
Status: CLOSED FIXED
Alias: None
Product: Internationalization
Classification: Code
Component: code (show other issues)
Version: OOo 2.3.1 RC1
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@l10n
URL:
Keywords:
Depends on:
Blocks: 88888
  Show dependency tree
 
Reported: 2008-01-20 12:49 UTC by lists
Modified: 2013-08-07 15:02 UTC (History)
7 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Example in 2.1 (24.23 KB, image/png)
2008-01-20 12:50 UTC, lists
no flags Details
Example in OOo 2.4 (29.06 KB, image/png)
2008-01-20 12:51 UTC, lists
no flags Details
Patch to change ZWSP to be a spacng character (6.71 KB, patch)
2008-07-13 13:50 UTC, lists
no flags Details | Diff
Patch for ICU 3.6 to add ZWSP as a word boundary (9.77 KB, patch)
2008-07-22 14:16 UTC, lists
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description lists 2008-01-20 12:49:42 UTC
Up to version 2.1 of OpenOffice, the ZWSP character (Zero width Space u200B) was
usable as a word separation character for purposes of line-breaking and
spell-checking. it is used in languages that do not use spaces to separate words
(Lao, Khmer, Burmese)

In OpenOffice 2.4 it no longer works for spell-checking. Words separated by a
ZWSP are considered as one word. It works correctly for line-breaking.

Nemeth Lazlo mentioned in th elist that this is probably a tokenization problem
inside the ICU library, as ICU is the one that divides the text for analysis by
Hunspell.

The issue is probably related to the upgrade to the new version of ICU after OOo 2.1

I am attaching two pictures, one for O0o 2.1 and one for 2.4. They show the same
texts, in Khmer and English, separated by ZWSP and separated by spaces. In the
2.4 graphic we can see that when we separate with ZWSP, then the text is
considered as incorrect. I the 2.1 graph we can see that the text is considered
correct in both cases (spaces and ZWSP).
Comment 1 lists 2008-01-20 12:50:41 UTC
Created attachment 51022 [details]
Example in 2.1
Comment 2 lists 2008-01-20 12:51:16 UTC
Created attachment 51023 [details]
Example in OOo 2.4
Comment 3 eric.savary 2008-01-21 09:50:23 UTC
Reassigned to SBA
Comment 4 lists 2008-06-13 06:38:23 UTC
The problem has been identified.

ICU, after 2.6 changed the ZWSP from being a spacing character to be a format
character, "not to confuse developers". The only result of this change is that
ZWSP is not longer a break-separator in ICU, because format character are not
supposed to be word separators.

This change in ICU was also integrated in the UNICODE dtabase of character
properties, in spite of the fact that the UNICODE character tables identify ZWSP
clearly as a spacing character.

Now ICU claims that the change back should take place in UNICODE, even if they
understand that ICU is broken, as ZWSP does not have the behaviour expected in
UNICODE (a word-separator).

The temporary solution, for 3.0, is to revert ZWSP to be a spacing character in
ICU. I am attaching a patch for this change, for ICU 4.0d2,as I understand this
is what will be finally ingrated in OpenOffice 3.0

This patch has to be integrated in the overall patch to ICU that OpenOffice has,
but as its format is different of the format of the diff that I can dom I am
only attaching the specific patch to change the propoerties of ZWSP.

This patch fixes a regression that exists since after OOo 2.1, and it is
necessary to use OOo in Khmer, which is now mandatory in all Cambodian schools.



Comment 5 ooo 2008-07-11 16:16:56 UTC
Javier, as mentioned in a mail, please attach a patch for ICU 3.6 as that is
what we'll continue to use for OOo3.0; we'd need to declare this issue as a show
stopper then. We may need a different patch for ICU 4.0 later when we switch to
ICU 4.0 for OOo3.1, if there isn't an ICU 4.1 by then that would fix the ZWSP.

Thanks
  Eike
Comment 6 lists 2008-07-13 13:48:39 UTC
I am attaching the patch for 3.6. I had done it already, but them I thought that
it was 4.0 that would go in. I am attaching it.

Thanks. I will now write to the release list.
Comment 7 lists 2008-07-13 13:50:12 UTC
Created attachment 55092 [details]
Patch to change ZWSP to be a spacng character
Comment 8 Martin Hollmichel 2008-07-14 09:09:11 UTC
accepted for 3.0 release.
Comment 9 ooo 2008-07-14 10:24:42 UTC
Karl, could you please take over and create a CWS for this patch only, targeted
to OOo3.0
Thanks
  Eike
Comment 10 lists 2008-07-22 14:08:28 UTC
Karl,

I think I found the problem.

Testing of the second build showed no change whatsoever. 

What I was changing were the text files that have all the descriptions of the
Unicode characters. I thought that this was the only place were this information
was present, and that this changes would be taken into account at compilation time.

But I was wrong. The compilation process does NOT use those text files by
default, so no changes were taking place in the code. 

It is necessary to run some applications (in the source of ICU) that
automatically generate three files: 

source/common/uchar_props_data.c
source/common/ubidi_props_data.c
source/common/ucase_props_data.c

I did that and got the new source that should solve the problem (I hope).

I am attaching a new patch (ChangeZWSPtoSpacingCharForICU36_v3.diff) in which
changes for these three files have been automatically generated from the data.

The original .txt files are also changed in the patch, but they do not affect
the result, only show which are the changes that have been take into the .c files

Comment 11 lists 2008-07-22 14:16:38 UTC
Created attachment 55291 [details]
Patch for ICU 3.6 to add ZWSP as a word boundary
Comment 12 karl.hong 2008-07-24 17:26:23 UTC
Fixed in cws i18n44.
Comment 13 karl.hong 2008-07-24 17:40:54 UTC
ready for QA.
Comment 14 michael.ruess 2008-08-08 12:07:50 UTC
We will take this fix into OO 3.1 branch. 
The affected area needs a lot of QA resources for accurate testing. These
resources can currently not be granted this short time before a major OOo release. 
Comment 15 lists 2008-08-08 12:53:17 UTC
This issue breaks completelly OpenOffice for three countries, making it useless. 

It is accepted as a showstopper for 3.0, and the training materials for Ministry
of Education in Cambodia are being developed based on OpenOffice 3.0 (OpenOffice
is mandatory in all schools).

The changes made include simple changes in files in the
i18npool/source/breakiterator/data directory, and also changes in ICU. 

We just finished building, and checked that the changes in the files in
i18npool/source/breakiterator/data are sufficient, and that it is not necessary
to add to the patch to ICU (no changes in ICU are necessary). Changing ICU 
would be a more complete solution, but changing the files in
i18npool/source/breakiterator/data is sufficient to solve the main problem.

Comment 16 ooo 2008-08-08 14:30:39 UTC
@mru:
> We will take this fix into OO 3.1 branch. 

Who decided that?

> The affected area needs a lot of QA resources for accurate testing.

What gives you this impression?

Note that the effective changes in CWS i18n44 are not the patch that is
attached to this issue, changes to the ICU were backed out again.

The change in CWS i18n44 only excludes the ZWSP character from the list
of control characters in breakiterator data files, and adds ZWSP to the
handling of whitespace for the breakiterators. Languages not using ZWSP
as a word delimiter are not even affected.

I strongly urge you to reconsider and retarget this issue and CWS i18n44
to OOo3.0 again, I'm re-adding the dependency to issue 88888 now.

Thanks
  Eike
Comment 17 michael.ruess 2008-08-08 15:00:10 UTC
@er:
MRU, UL and TZ (aka thorstenziehm in Issuetracker) decided this
The following facts were decisive for us:
- changes and fixes in the break iterator lead to various regressions in the past
- currently low QA resources due to vacation
- the issue was reported on "Version 2.3.1", so why not stopper in 2.4.x
- the issue was untouched for a very long time; also it was three weeks in DEV
with state "fixed" and reached the QA right in the phase of release time
- the CWS was build on a milestone, where linguistic was broken for many
languages, so a re-synch would also be necessary

These reasons made us decide to reject it as a "stopper".
Comment 18 lists 2008-08-09 04:36:10 UTC
The change from ICU 2.6 to 3.6 introduced several issues that made 2.3 and 2.4
unusable for Khmer, so they were not used. The spellchecker was developed 
recently, which is what has led to detecting the problem, then it has been
tracked back to earlier versions. We reallised that it worked in 2.1, but not in
later version.

It was not clear until the last minute which version of ICU was going to be used
in OpenOffice 3.0. We started working on ICU 4.0 and then had to change to ICU
3.6, and then ended up finding out that the issue was in on OOo code itself,
copied form ICU.

The changes spams one single Unicode character, ZWSP (ZERO WIDTH SPACE), which
is taken out of the control and format groups for word-separation (for word
selection and spellchecing), making it a normal character that can be used for
word separation (what it is supposed to be). Format characters are not used for
word separation. This is done already for line-breaking (in a different way),
but not for word-separation, which affects spellchecking.

The change cannot affect anything that does not use the ZWSP character, which is
used only in Khmer (Cambodian), Lao, and Myanmar text. None of these languages
can be treated correctly without this issue being fixed.

Cambodia is the only country in the world that has mandated the use of
OpenOffice in all its education system... and implemented it. It is still using
OpeOffice 2.1. Not being able to use OpenOffice 3.0 - when we have told them
that the problems had been fixed (based on acceptance of the showstopper) -
would really send the wrong message to the MOST friendly government in the world
to OpenOffice.

Comment 19 ooo 2008-08-09 13:55:07 UTC
I do not accept the decision to not include the fix of this issue. As
Javier already stated, not including the fix will render OOo3.0
completely useless for a language spoken in a country that was the first
in the world to _mandate_ use of OOo in school and education.

> - changes and fixes in the break iterator lead to various regressions
> in the past

True. But this one affects ZWSP only. I just reviewed the changes on the
branch. Languages not using ZWSP will not see any difference, and
languages using ZWSP now will be usable.

> - currently low QA resources due to vacation

The QA left to be done would be to start the automated tests, and to
verify for some arbitrary sample language that the break iterator's
treatment of whitespace is not broken.

> - the issue was untouched for a very long time; also it was three
> weeks in DEV with state "fixed" and reached the QA right in the phase of
> release time

Status FIXED in DEV was because Javier actually tested builds including
fixes because no one at Sun is able to read or write Khmer. And while
doing so the fix was recoded because it turned out that changes to the
ICU were not necessary. Unfortunately when the issue was reassigned to
QA the CWS was not set rfQA, so that may have caused some unnecessary
delay of about a week or so.

> - the CWS was build on a milestone, where linguistic was broken for many
> languages, so a re-synch would also be necessary

Well, too bad, but not a real problem. I resynced to m28, new install
sets are ready. wntmsci12.pro is still building but should also be ready
soon. Just pick it up later.
Comment 20 michael.ruess 2008-08-11 13:47:54 UTC
OK, I have talked with ER about this and he reassured me that the fix is not
risky at all; that only the use of the ZWSP is affected.
Also the fact, that Javier already tested builds based on that code limits the
necessary resources of QA for this.
Comment 21 lists 2008-08-11 18:02:53 UTC
Thanks!
Comment 22 thorsten.ziehm 2009-07-20 14:53:27 UTC
This issue is closed automatically and wasn't rechecked in a current version of
OOo. The fixed issue should be integrated in OOo since more than half a year. If
you think this issue isn't fixed in a current version (OOo 3.1), please reopen
it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues