Issue 107843 - Words containing em dash fail spell check
Summary: Words containing em dash fail spell check
Status: CLOSED FIXED
Alias: None
Product: Writer
Classification: Application
Component: editing (show other issues)
Version: OOO320m8
Hardware: PC Windows XP
: P3 Trivial with 10 votes (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@sw
URL:
Keywords: oooqa, regression
: 109202 109221 110374 110442 111347 111626 112099 (view as issue list)
Depends on:
Blocks: 109046
  Show dependency tree
 
Reported: 2009-12-22 20:23 UTC by craig_icg
Modified: 2013-08-07 14:44 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description craig_icg 2009-12-22 20:23:31 UTC
Two words separated by an em dash fail spell check. This looks like an old bug
come back to life. Bug not present in 3.1.1 and 2.4.3.
Comment 1 craig_icg 2009-12-22 20:30:09 UTC
Sorry, I meant to say words *separated* by an em dash.
Comment 2 eric.savary 2010-01-04 15:11:16 UTC
@SBA: please take over.

In "something--like--this" every word will be marked as wrong.
Apparently the spell checker reads it as: "something-like-this".
Comment 3 stefan.baltzer 2010-01-04 16:32:19 UTC
SBA->TL: Please proceed, thank you.
Comment 4 thomas.lange 2010-01-04 16:42:34 UTC
tl->nemeth: Hi Lazlo, what do you think about this?

The easiest solution would be to replace all dashes that are accepted by the
breakiterator by ASCII dashes before handing them to the spell
checker/hyphenator/thesaurus. 
This of course has the drawback that the result will also only contain ASCII
dashes. Thus when replacing a misspelled word with e.g. em-dash with the correct
suggestion the em-dash will implicitly be replaced as well.
Otherwise we would need a more complex logic just to keep the same type of dash
in the result.
Comment 5 nemeth.lacko 2010-01-13 08:19:24 UTC
nemeth->tl: em dash (U+2014) is a real word separator, en dash (U+2013) is for
ranges and relationships, so it has a similar role, see
http://en.wikipedia.org/wiki/Dash#Em_dash). I think, em dash and en dash have to
be word separators.
Checking em dash and en dash usage is better with grammar checkers.

(By the way, Unicode and 8-bit Windows encodings contain en dash and em dash, so
the problem is specific for the dictionary encoding, too. I believe, Hunspell
supports default word breaking at en dashes with UTF-8 encoded dictionaries. For
other encodings, you can convert en dashes to double hyphens (--) temporarily,
Hunspell will handle it correctly:

$ ~/hunspell-1.2.8/src/tools/hunspell -d
/opt/openoffice.org3/share/uno_packages/cache/uno_packages/SdbQvF_/dict-en.oxt/en_US
Bose--Einsten
& Bose--Einsten 9 0: Bose--Einstein, Bose--Einsteinium, Bose--Eisenstein,
Bose--Reinstate, Bose--Minster, Einsteinium, Reinstatement, Liechtenstein,
Rubinstein)
Comment 6 T. J. Frazier 2010-02-06 09:31:18 UTC
Added CC. What a PITA. For those of us--like me--who write this way, this
sentence will contain two spelling errors.
Comment 7 paddylandau 2010-02-12 14:22:40 UTC
Just to emphasise that this applies to both en-dash and em-dash.
Comment 8 michael.ruess 2010-02-12 14:27:13 UTC
*** Issue 109202 has been marked as a duplicate of this issue. ***
Comment 9 Joe Smith 2010-02-12 20:03:48 UTC
I don't know if it's related to this, but autocorrect/word completion also fails
to recognize em-dash as a word separator.

I just filed Issue 109221 for that one.
Comment 10 paddylandau 2010-02-13 10:34:20 UTC
I would say that Issue 109221 is indeed related, probably part of this problem.

I'm glad you brought it up, as it might have been missed.

Remember to vote for this issue!
Comment 11 kenirving 2010-03-09 11:52:16 UTC
This is worse than annoying. In some text I edit regularly, monthly dates are
separated from comment text by em dashes, and the resulting cacophony of wiggly
red lines makes the spell checker unusable, because each separate instance is
treated as a new misspelled word. I've been using Ooo since the original
Staroffice days and this is the worst bug I've seen over that whole period. It
deserves to go to the head of the pack and get a patch toute de suite. It is not
a minor issue since it may force me to use MS Word on Windows (in which I'm not
fluent) until it is fixed, and downgrade my Linux installations. A nightmare.
Comment 12 nemeth.lacko 2010-03-09 12:41:18 UTC
nemeth->tl: It would be fine to give an immediately fix by (semi?)automatic
dictionary update using UTF-8 versions of the English dictionaries with the
following BREAK rules:

BREAK 2
BREAK –
BREAK -

If this dictionary update is possible and you are interested in it, I will make
the UTF-8 versions of the English dictionaries with the previous BREAK rules.
Comment 13 thomas.lange 2010-03-09 14:39:35 UTC
tl->nemeth: I just finished the last 'must be' feature for OOo 3.3 and I'm now
going to spend some days on fixing some regression like issue (like this one)
before implementing some more dialogs.

Form your earlier comment here (Wed Jan 13 08:19:24) I was planning to remove
the entries for em-dash and en-dash from the mid-letter definition in the
breakiterator which will make them word breaking characters again. Then only the
hyphen/minus chara will remain as non-word-breaking from out last change to the
breakiterator.

Will that be fine with you? Or should I leave the en-dash as word-break chara as
well?

However I cannot convert en-dash to '--' depending on the encoding type of the
dictionary, since that is not available in the API. The encoding type of course
available in the hunspell wrapper in lingucomponent. But I think it would be
more clean to handle this issue directly in hunspell. If you can point me to a
rough location I would be willing to look into it myself.
What do you think?

However the first issue to solve is if em-dash AND en-dash BOTH should become
word-breaking charas again or if it should just be em-dash.
From reading the provided wikipdia link I would say both of them should become
word breaking again.
What do you think?
Comment 14 paddylandau 2010-03-09 15:14:34 UTC
@tl: "... if em-dash AND en-dash BOTH should become word-breaking charas again
or if it should just be em-dash. ... I would say both of them should become
word breaking again."

Absolutely: Both em-dash and en-dash are word breaking. That's how they worked
before, and how they should continue to work.

The en-dash is not a replacement for a hyphen, because they perform different roles.
Comment 15 nemeth.lacko 2010-03-09 17:55:11 UTC
I have made a fix, you can install by the Extension manager of OpenOffice.org: 

http://numbertext.org/tmp/dict-en.oxt

Release notes 

Spelling dictionaries:

2010-03-09 (nemeth AT OOo)
  - UTF-8 encoded dictionary:
       - fix em-dash problem of OOo 3.2 by BREAK
       - suggesting words with typographical apostrophes
       - recognizing words with Unicode f ligatures
  - add phonetic suggestion (Copyright (C) 2000 Björn Jacke, see the end of the
file)

en-US hyphenation dictionary:

- UTF-8 encoding
- Unicode ligature support

I try to make a new project on the OpenOffice.org extension site, but the site
hasn't been worked yet (service unavailable, Guru mediation XID: 708565805).

nemeth->tl, pandylandau: Yes, en-dash and em-dash have to be word-breaking
characters.
Comment 16 nemeth.lacko 2010-03-09 18:03:35 UTC
En-dash handling is also fixed. The relevant section from the affix files:

BREAK 3
BREAK —
BREAK –
BREAK -
Comment 17 nemeth.lacko 2010-03-09 19:09:04 UTC
OpenOffice.org extension for the fix: dict-en-fixed

http://extensions.services.openoffice.org/hu/project/dict-en-fixed

Comment 18 paddylandau 2010-03-09 19:17:23 UTC
@nemeth

Thank you, it works perfectly for me!

One odd thing (which is not a problem, just curious) is that when a word is
misspelled, it highlights both words and the em-dash or en-dash. But when the
word is corrected, it correctly removes the red line.
Comment 19 thomas.lange 2010-03-10 08:43:14 UTC
.
Comment 20 thomas.lange 2010-03-10 08:54:55 UTC
Ok, I just proposed this issue as a show stopper for OOo 3.2.1 since it is a
regression and easy to fix. In any case this one will be fixed at least in OOo 3.3.
Comment 21 nemeth.lacko 2010-03-10 09:10:20 UTC
nemet->tl: Thanks, I agree with you, this is a real show stopper for 3.2.1.

nemeth->paddylandau: Thans for the test and feedback. The long red lines are the
same as before, but now the spell checker can recognize the words in the
character sequences with en or em dashes.
Comment 22 nemeth.lacko 2010-03-10 11:46:10 UTC
A new issue for the dictionary update: Issue 110007.
Comment 23 thomas.lange 2010-03-11 11:23:52 UTC
Approved as show stopper, setting target to OOo 3.2.1.
Comment 24 nemeth.lacko 2010-03-12 10:54:02 UTC
I have found a related problem in hyphenation, too. I have solved the ugliest
1-character distance hyphenation from dashes (eg. something—t=wo, ad=d-on) by
the new release of the improved English dictionaries
(http://extensions.services.openoffice.org/hu/project/dict-en-fixed), but I will
make a new Hyphen release to solve the others.
Comment 25 thomas.lange 2010-03-15 08:56:29 UTC
Fixed in CWS os140.

Files changed:
M  i18npool\source\breakiterator\data\dict_word_prepostdash.txt
M  i18npool\source\breakiterator\data\dict_word.txt
Comment 26 thomas.lange 2010-03-15 09:13:40 UTC
Correction: Fixed in sw321bf01.
Comment 27 eric.savary 2010-03-24 23:39:26 UTC
*** Issue 110374 has been marked as a duplicate of this issue. ***
Comment 28 Oliver Brinzing 2010-03-25 18:04:06 UTC
.
Comment 29 eric.savary 2010-03-28 11:21:29 UTC
*** Issue 110442 has been marked as a duplicate of this issue. ***
Comment 30 Oliver-Rainer Wittmann 2010-03-31 13:54:45 UTC
OD->SBA: Please check and verify in internal installation set of cws sw321bf01 -
Thx.
Comment 31 stefan.baltzer 2010-04-01 15:00:50 UTC
Verified in CWS sw321bf01.
Comment 32 michael.ruess 2010-04-23 14:31:47 UTC
*** Issue 109221 has been marked as a duplicate of this issue. ***
Comment 33 eric.savary 2010-05-03 17:54:20 UTC
*** Issue 111347 has been marked as a duplicate of this issue. ***
Comment 34 eric.savary 2010-05-15 21:50:13 UTC
*** Issue 111626 has been marked as a duplicate of this issue. ***
Comment 35 thackert 2010-05-30 17:16:27 UTC
Hello craig_icg, *,
have you tested it with a newer build than OOO320m8? I have tested it with
Germanophone version of OOO320m18 under Debian SID/Experimental AMD64, where it
works. As I have no English (or other NL resp. OS/architecture) version
installed, it would be nice, if you could test it on your system again ... ;)
TIA
Thomas.
Comment 36 craig_icg 2010-05-31 13:45:38 UTC
Verified fixed in both:

3.2.1 RC1 (OOO320m18 build 9498 as in Help -> About)
3.2.1 RC2 (OOO320m18 build 9502 as in Help -> About)

I am closing this. Thank you for fixing this bug.
Comment 37 eric.savary 2010-06-04 02:47:04 UTC
*** Issue 112099 has been marked as a duplicate of this issue. ***