Apache OpenOffice (AOO) Bugzilla – Issue 85411
ZWSP is no longer considered as a word separation character for spellchecking
Last modified: 2013-08-07 15:02:29 UTC
Up to version 2.1 of OpenOffice, the ZWSP character (Zero width Space u200B) was usable as a word separation character for purposes of line-breaking and spell-checking. it is used in languages that do not use spaces to separate words (Lao, Khmer, Burmese) In OpenOffice 2.4 it no longer works for spell-checking. Words separated by a ZWSP are considered as one word. It works correctly for line-breaking. Nemeth Lazlo mentioned in th elist that this is probably a tokenization problem inside the ICU library, as ICU is the one that divides the text for analysis by Hunspell. The issue is probably related to the upgrade to the new version of ICU after OOo 2.1 I am attaching two pictures, one for O0o 2.1 and one for 2.4. They show the same texts, in Khmer and English, separated by ZWSP and separated by spaces. In the 2.4 graphic we can see that when we separate with ZWSP, then the text is considered as incorrect. I the 2.1 graph we can see that the text is considered correct in both cases (spaces and ZWSP).
Created attachment 51022 [details] Example in 2.1
Created attachment 51023 [details] Example in OOo 2.4
Reassigned to SBA
The problem has been identified. ICU, after 2.6 changed the ZWSP from being a spacing character to be a format character, "not to confuse developers". The only result of this change is that ZWSP is not longer a break-separator in ICU, because format character are not supposed to be word separators. This change in ICU was also integrated in the UNICODE dtabase of character properties, in spite of the fact that the UNICODE character tables identify ZWSP clearly as a spacing character. Now ICU claims that the change back should take place in UNICODE, even if they understand that ICU is broken, as ZWSP does not have the behaviour expected in UNICODE (a word-separator). The temporary solution, for 3.0, is to revert ZWSP to be a spacing character in ICU. I am attaching a patch for this change, for ICU 4.0d2,as I understand this is what will be finally ingrated in OpenOffice 3.0 This patch has to be integrated in the overall patch to ICU that OpenOffice has, but as its format is different of the format of the diff that I can dom I am only attaching the specific patch to change the propoerties of ZWSP. This patch fixes a regression that exists since after OOo 2.1, and it is necessary to use OOo in Khmer, which is now mandatory in all Cambodian schools.
Javier, as mentioned in a mail, please attach a patch for ICU 3.6 as that is what we'll continue to use for OOo3.0; we'd need to declare this issue as a show stopper then. We may need a different patch for ICU 4.0 later when we switch to ICU 4.0 for OOo3.1, if there isn't an ICU 4.1 by then that would fix the ZWSP. Thanks Eike
I am attaching the patch for 3.6. I had done it already, but them I thought that it was 4.0 that would go in. I am attaching it. Thanks. I will now write to the release list.
Created attachment 55092 [details] Patch to change ZWSP to be a spacng character
accepted for 3.0 release.
Karl, could you please take over and create a CWS for this patch only, targeted to OOo3.0 Thanks Eike
Karl, I think I found the problem. Testing of the second build showed no change whatsoever. What I was changing were the text files that have all the descriptions of the Unicode characters. I thought that this was the only place were this information was present, and that this changes would be taken into account at compilation time. But I was wrong. The compilation process does NOT use those text files by default, so no changes were taking place in the code. It is necessary to run some applications (in the source of ICU) that automatically generate three files: source/common/uchar_props_data.c source/common/ubidi_props_data.c source/common/ucase_props_data.c I did that and got the new source that should solve the problem (I hope). I am attaching a new patch (ChangeZWSPtoSpacingCharForICU36_v3.diff) in which changes for these three files have been automatically generated from the data. The original .txt files are also changed in the patch, but they do not affect the result, only show which are the changes that have been take into the .c files
Created attachment 55291 [details] Patch for ICU 3.6 to add ZWSP as a word boundary
Fixed in cws i18n44.
ready for QA.
We will take this fix into OO 3.1 branch. The affected area needs a lot of QA resources for accurate testing. These resources can currently not be granted this short time before a major OOo release.
This issue breaks completelly OpenOffice for three countries, making it useless. It is accepted as a showstopper for 3.0, and the training materials for Ministry of Education in Cambodia are being developed based on OpenOffice 3.0 (OpenOffice is mandatory in all schools). The changes made include simple changes in files in the i18npool/source/breakiterator/data directory, and also changes in ICU. We just finished building, and checked that the changes in the files in i18npool/source/breakiterator/data are sufficient, and that it is not necessary to add to the patch to ICU (no changes in ICU are necessary). Changing ICU would be a more complete solution, but changing the files in i18npool/source/breakiterator/data is sufficient to solve the main problem.
@mru: > We will take this fix into OO 3.1 branch. Who decided that? > The affected area needs a lot of QA resources for accurate testing. What gives you this impression? Note that the effective changes in CWS i18n44 are not the patch that is attached to this issue, changes to the ICU were backed out again. The change in CWS i18n44 only excludes the ZWSP character from the list of control characters in breakiterator data files, and adds ZWSP to the handling of whitespace for the breakiterators. Languages not using ZWSP as a word delimiter are not even affected. I strongly urge you to reconsider and retarget this issue and CWS i18n44 to OOo3.0 again, I'm re-adding the dependency to issue 88888 now. Thanks Eike
@er: MRU, UL and TZ (aka thorstenziehm in Issuetracker) decided this The following facts were decisive for us: - changes and fixes in the break iterator lead to various regressions in the past - currently low QA resources due to vacation - the issue was reported on "Version 2.3.1", so why not stopper in 2.4.x - the issue was untouched for a very long time; also it was three weeks in DEV with state "fixed" and reached the QA right in the phase of release time - the CWS was build on a milestone, where linguistic was broken for many languages, so a re-synch would also be necessary These reasons made us decide to reject it as a "stopper".
The change from ICU 2.6 to 3.6 introduced several issues that made 2.3 and 2.4 unusable for Khmer, so they were not used. The spellchecker was developed recently, which is what has led to detecting the problem, then it has been tracked back to earlier versions. We reallised that it worked in 2.1, but not in later version. It was not clear until the last minute which version of ICU was going to be used in OpenOffice 3.0. We started working on ICU 4.0 and then had to change to ICU 3.6, and then ended up finding out that the issue was in on OOo code itself, copied form ICU. The changes spams one single Unicode character, ZWSP (ZERO WIDTH SPACE), which is taken out of the control and format groups for word-separation (for word selection and spellchecing), making it a normal character that can be used for word separation (what it is supposed to be). Format characters are not used for word separation. This is done already for line-breaking (in a different way), but not for word-separation, which affects spellchecking. The change cannot affect anything that does not use the ZWSP character, which is used only in Khmer (Cambodian), Lao, and Myanmar text. None of these languages can be treated correctly without this issue being fixed. Cambodia is the only country in the world that has mandated the use of OpenOffice in all its education system... and implemented it. It is still using OpeOffice 2.1. Not being able to use OpenOffice 3.0 - when we have told them that the problems had been fixed (based on acceptance of the showstopper) - would really send the wrong message to the MOST friendly government in the world to OpenOffice.
I do not accept the decision to not include the fix of this issue. As Javier already stated, not including the fix will render OOo3.0 completely useless for a language spoken in a country that was the first in the world to _mandate_ use of OOo in school and education. > - changes and fixes in the break iterator lead to various regressions > in the past True. But this one affects ZWSP only. I just reviewed the changes on the branch. Languages not using ZWSP will not see any difference, and languages using ZWSP now will be usable. > - currently low QA resources due to vacation The QA left to be done would be to start the automated tests, and to verify for some arbitrary sample language that the break iterator's treatment of whitespace is not broken. > - the issue was untouched for a very long time; also it was three > weeks in DEV with state "fixed" and reached the QA right in the phase of > release time Status FIXED in DEV was because Javier actually tested builds including fixes because no one at Sun is able to read or write Khmer. And while doing so the fix was recoded because it turned out that changes to the ICU were not necessary. Unfortunately when the issue was reassigned to QA the CWS was not set rfQA, so that may have caused some unnecessary delay of about a week or so. > - the CWS was build on a milestone, where linguistic was broken for many > languages, so a re-synch would also be necessary Well, too bad, but not a real problem. I resynced to m28, new install sets are ready. wntmsci12.pro is still building but should also be ready soon. Just pick it up later.
OK, I have talked with ER about this and he reassured me that the fix is not risky at all; that only the use of the ZWSP is affected. Also the fact, that Javier already tested builds based on that code limits the necessary resources of QA for this.
Thanks!
This issue is closed automatically and wasn't rechecked in a current version of OOo. The fixed issue should be integrated in OOo since more than half a year. If you think this issue isn't fixed in a current version (OOo 3.1), please reopen it and change the field 'Target Milestone' accordingly. If you want to download a current version of OOo => http://download.openoffice.org/index.html If you want to know more about the handling of fixed/verified issues => http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues