Apache OpenOffice (AOO) Bugzilla – Issue 107843
Words containing em dash fail spell check
Last modified: 2013-08-07 14:44:16 UTC
Two words separated by an em dash fail spell check. This looks like an old bug come back to life. Bug not present in 3.1.1 and 2.4.3.
Sorry, I meant to say words *separated* by an em dash.
@SBA: please take over. In "something--like--this" every word will be marked as wrong. Apparently the spell checker reads it as: "something-like-this".
SBA->TL: Please proceed, thank you.
tl->nemeth: Hi Lazlo, what do you think about this? The easiest solution would be to replace all dashes that are accepted by the breakiterator by ASCII dashes before handing them to the spell checker/hyphenator/thesaurus. This of course has the drawback that the result will also only contain ASCII dashes. Thus when replacing a misspelled word with e.g. em-dash with the correct suggestion the em-dash will implicitly be replaced as well. Otherwise we would need a more complex logic just to keep the same type of dash in the result.
nemeth->tl: em dash (U+2014) is a real word separator, en dash (U+2013) is for ranges and relationships, so it has a similar role, see http://en.wikipedia.org/wiki/Dash#Em_dash). I think, em dash and en dash have to be word separators. Checking em dash and en dash usage is better with grammar checkers. (By the way, Unicode and 8-bit Windows encodings contain en dash and em dash, so the problem is specific for the dictionary encoding, too. I believe, Hunspell supports default word breaking at en dashes with UTF-8 encoded dictionaries. For other encodings, you can convert en dashes to double hyphens (--) temporarily, Hunspell will handle it correctly: $ ~/hunspell-1.2.8/src/tools/hunspell -d /opt/openoffice.org3/share/uno_packages/cache/uno_packages/SdbQvF_/dict-en.oxt/en_US Bose--Einsten & Bose--Einsten 9 0: Bose--Einstein, Bose--Einsteinium, Bose--Eisenstein, Bose--Reinstate, Bose--Minster, Einsteinium, Reinstatement, Liechtenstein, Rubinstein)
Added CC. What a PITA. For those of us--like me--who write this way, this sentence will contain two spelling errors.
Just to emphasise that this applies to both en-dash and em-dash.
*** Issue 109202 has been marked as a duplicate of this issue. ***
I don't know if it's related to this, but autocorrect/word completion also fails to recognize em-dash as a word separator. I just filed Issue 109221 for that one.
I would say that Issue 109221 is indeed related, probably part of this problem. I'm glad you brought it up, as it might have been missed. Remember to vote for this issue!
This is worse than annoying. In some text I edit regularly, monthly dates are separated from comment text by em dashes, and the resulting cacophony of wiggly red lines makes the spell checker unusable, because each separate instance is treated as a new misspelled word. I've been using Ooo since the original Staroffice days and this is the worst bug I've seen over that whole period. It deserves to go to the head of the pack and get a patch toute de suite. It is not a minor issue since it may force me to use MS Word on Windows (in which I'm not fluent) until it is fixed, and downgrade my Linux installations. A nightmare.
nemeth->tl: It would be fine to give an immediately fix by (semi?)automatic dictionary update using UTF-8 versions of the English dictionaries with the following BREAK rules: BREAK 2 BREAK – BREAK - If this dictionary update is possible and you are interested in it, I will make the UTF-8 versions of the English dictionaries with the previous BREAK rules.
tl->nemeth: I just finished the last 'must be' feature for OOo 3.3 and I'm now going to spend some days on fixing some regression like issue (like this one) before implementing some more dialogs. Form your earlier comment here (Wed Jan 13 08:19:24) I was planning to remove the entries for em-dash and en-dash from the mid-letter definition in the breakiterator which will make them word breaking characters again. Then only the hyphen/minus chara will remain as non-word-breaking from out last change to the breakiterator. Will that be fine with you? Or should I leave the en-dash as word-break chara as well? However I cannot convert en-dash to '--' depending on the encoding type of the dictionary, since that is not available in the API. The encoding type of course available in the hunspell wrapper in lingucomponent. But I think it would be more clean to handle this issue directly in hunspell. If you can point me to a rough location I would be willing to look into it myself. What do you think? However the first issue to solve is if em-dash AND en-dash BOTH should become word-breaking charas again or if it should just be em-dash. From reading the provided wikipdia link I would say both of them should become word breaking again. What do you think?
@tl: "... if em-dash AND en-dash BOTH should become word-breaking charas again or if it should just be em-dash. ... I would say both of them should become word breaking again." Absolutely: Both em-dash and en-dash are word breaking. That's how they worked before, and how they should continue to work. The en-dash is not a replacement for a hyphen, because they perform different roles.
I have made a fix, you can install by the Extension manager of OpenOffice.org: http://numbertext.org/tmp/dict-en.oxt Release notes Spelling dictionaries: 2010-03-09 (nemeth AT OOo) - UTF-8 encoded dictionary: - fix em-dash problem of OOo 3.2 by BREAK - suggesting words with typographical apostrophes - recognizing words with Unicode f ligatures - add phonetic suggestion (Copyright (C) 2000 Björn Jacke, see the end of the file) en-US hyphenation dictionary: - UTF-8 encoding - Unicode ligature support I try to make a new project on the OpenOffice.org extension site, but the site hasn't been worked yet (service unavailable, Guru mediation XID: 708565805). nemeth->tl, pandylandau: Yes, en-dash and em-dash have to be word-breaking characters.
En-dash handling is also fixed. The relevant section from the affix files: BREAK 3 BREAK — BREAK – BREAK -
OpenOffice.org extension for the fix: dict-en-fixed http://extensions.services.openoffice.org/hu/project/dict-en-fixed
@nemeth Thank you, it works perfectly for me! One odd thing (which is not a problem, just curious) is that when a word is misspelled, it highlights both words and the em-dash or en-dash. But when the word is corrected, it correctly removes the red line.
.
Ok, I just proposed this issue as a show stopper for OOo 3.2.1 since it is a regression and easy to fix. In any case this one will be fixed at least in OOo 3.3.
nemet->tl: Thanks, I agree with you, this is a real show stopper for 3.2.1. nemeth->paddylandau: Thans for the test and feedback. The long red lines are the same as before, but now the spell checker can recognize the words in the character sequences with en or em dashes.
A new issue for the dictionary update: Issue 110007.
Approved as show stopper, setting target to OOo 3.2.1.
I have found a related problem in hyphenation, too. I have solved the ugliest 1-character distance hyphenation from dashes (eg. something—t=wo, ad=d-on) by the new release of the improved English dictionaries (http://extensions.services.openoffice.org/hu/project/dict-en-fixed), but I will make a new Hyphen release to solve the others.
Fixed in CWS os140. Files changed: M i18npool\source\breakiterator\data\dict_word_prepostdash.txt M i18npool\source\breakiterator\data\dict_word.txt
Correction: Fixed in sw321bf01.
*** Issue 110374 has been marked as a duplicate of this issue. ***
*** Issue 110442 has been marked as a duplicate of this issue. ***
OD->SBA: Please check and verify in internal installation set of cws sw321bf01 - Thx.
Verified in CWS sw321bf01.
*** Issue 109221 has been marked as a duplicate of this issue. ***
*** Issue 111347 has been marked as a duplicate of this issue. ***
*** Issue 111626 has been marked as a duplicate of this issue. ***
Hello craig_icg, *, have you tested it with a newer build than OOO320m8? I have tested it with Germanophone version of OOO320m18 under Debian SID/Experimental AMD64, where it works. As I have no English (or other NL resp. OS/architecture) version installed, it would be nice, if you could test it on your system again ... ;) TIA Thomas.
Verified fixed in both: 3.2.1 RC1 (OOO320m18 build 9498 as in Help -> About) 3.2.1 RC2 (OOO320m18 build 9502 as in Help -> About) I am closing this. Thank you for fixing this bug.
*** Issue 112099 has been marked as a duplicate of this issue. ***