Issue 113558 - Change Case broken by language tags and/or ligatures
Summary: Change Case broken by language tags and/or ligatures
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOO330m1
Hardware: PC Windows, all
: P2 Trivial (vote)
Target Milestone: ---
Assignee: writerneedsconfirm
QA Contact: issues@sw
Depends on:
Reported: 2010-07-31 06:10 UTC by jurf
Modified: 2010-08-03 02:32 UTC (History)
4 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---

Examples (input / expected / actual output) (21.26 KB, application/vnd.oasis.opendocument.text)
2010-07-31 06:12 UTC, jurf
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description jurf 2010-07-31 06:10:08 UTC
Casing options broken by language tags and/or ligatures

Issue 1601 (, marked Fixed
and with CWS tl74 included in OOo-dev300m85 (tested) and OOO330m2 (not tested,
but likely identical), implements three new and welcome options in Format |
Change case, namely:

	Sentence case
	Capitalize Every Word

Whilst I've not tested tOGGLE cASE (it's not something I need), I have spent a
good while poking Sentence case and Capitalize Every Word with a stick. Both
functions are, unfortunately, very buggy. The implementation of Capitalize Every
Word is especially bad, with a high probability of data loss (disappearing text
with no guarantee that Undo works properly). So far, I've seen the bugs be
triggered by either language mark-up or ligatures (the latter not necessarily in
text selections), which are actually the only conditions I've been testing for.
As such, it's likely there are other triggers, too.

The data loss is particularly troubling as the "undo" function, even if given
sufficient steps, does not necessarily restore the original text correctly. And
even that assumes that the user is half-expecting trouble.

Issue present in both Writer and Calc (not tested others), and in both cases is

I'm attaching an ODT file to this issue. It contains several examples you can
try out yourself, together with mock-ups of expected and actual results.



In brief, the main problems I've found so far are:

Sentence case
- The presence of language mark-up within selected text confuses the parser,
causing it to consider the marked-up section as a new sentence, thus
capitalizing two or more words in the middle of a sentence.

Capitalize Every Word
- Language mark-up causes similar miscalculations, but more exaggerated,
potentially causing data loss (see attached file)
- The presence of ligatures, either within selected text, or before it (but in
the same paragraph) causes similar problems.
- Applying Capitalize Every Word to multiple selections further exacerbates the



I'm not a programmer, but I think the primary cause of the bugs in either
function is a miscalculation of selection bounds, which leads to at times
extremely severe offset errors both as regards the selection area and the bounds
of the text itself. Among the causes would appear to be:
1. the parser gives language declarations a width (two characters for each
"tag", apparently, being one for the opening, another for the closure);
2. the parser miscounts the length of ligatures (unicode FF00 to FF06) whether
or not they're selected, which causes both selections and actual words processed
to expand to the right - if there's no room at the end of the paragraph for this
expansion, text disappears;
3. multiple selections are incorrectly handled (it appears as though errors in
one selection block are carried over to the next, and so on). This may simply be
the symptomatic of the first potential causes, but it may also be compounded by
buffers not being cleared. Or something (TM).

The problem was exacerbated, I think, by the original test case
which is just plain text: no formatting, no language tags, no awkward characters
such as non-diphthong ligatures (ff, fi, fl, etc.)



The following is a simple example of the buggy behaviour of Sentence case, to
give you an idea of the type of problem. See the attached file for many more
examples (all different) of both Sentence case and Capitalize Every Word:

Input:		the rapide brown fox [with "rapide" marked as French]
Expected:	The rapide brown fox
Output:		The Rapide Brown fox

The underlying code (from contents.xml) is this, where T3 is default format, and
T4 is French:

<text:p text:style-name="Standard">
	<text:span text:style-name="T3">The </text:span>
	<text:span text:style-name="T4">Rapide BroWn Fox-Like Creat</text:span>
	<text:span text:style-name="T3">ure</text:span>


Given the possibility of data loss, I reckon this should be a SHOWSTOPPER for
3.3 - but I'll leave it to one of the experts to decide and, if so, add it to
the meta issue.

Comment 1 jurf 2010-07-31 06:12:59 UTC
Created attachment 70901 [details]
Examples (input / expected / actual output)
Comment 2 jurf 2010-07-31 06:35:39 UTC
My apologies, I merged two examples into one in my post. It should have one of
these examples:


the rapide brown fox [with 'rapide' marked as French]

The Rapide Brown fox

Underlying code:
<text:p text:style-name="P1">The <text:span
text:style-name="T4">Rapide</text:span> Brown fox</text:p>

2. Capitalize Every Word

the rapide brown fox-like creature [with 'rapide' marked as French]

The Rapide BroWn Fox-Like Creature [with everything from 'Rapide' to 'Creat'
inclusive marked French]

Underlying code:
<text:p text:style-name="Body">The <text:span text:style-name="T4">Rapide BroWn
Fox-Like Creat</text:span>ure</text:p>
Comment 3 jurf 2010-08-01 05:25:39 UTC
One more comment...

After posting this report yesterday, I starting playing with the new user
dictionary interface in M85 (the default for new user dictionary files has
changed from binary to UTF-8). There are bugs there, too, which may possibly be
related to the casing errors. So, please don't treat the following as a separate
bug report (it's in the wrong place for that, I know), but instead as a clue to
the possible cause of the casing errors.

In short, when I add a word to a user dictionary that contains a double-byte
character (eg a letter combined with an unusual accent, such as dot underneath),
or if the user dictionary already contains such words, things start getting
buggy: in some cases, an *incomplete* copy of the last word in the list gets
appended to the dictionary; in other cases, the word is not added to the
selected dictionary at all, but to another one.

Again, I'm not a programmer, but if I were to bet on it, I'd guess there's a
possibility that both sets of errors are caused by a bug in a text parsing
library used by both the casing and user dictionary routines.

Reason for saying this is that all the errors - casing and dictionary - appear
to involve miscalculating text bounds.

The parallel is particularly compelling when comparing Capitalize Every Word's
mangling of text with ligatures (which could be counted as one, two or more
characters), to the dictionary parser's mangling of user dictionaries that
contain non-compiled characters with combining accents (which could also be
counted as one, two or more characters).

Has something recently changed in a text parsing component?
Comment 4 Olaf Felka 2010-08-02 07:32:56 UTC
*** Issue 113568 has been marked as a duplicate of this issue. ***
Comment 5 Olaf Felka 2010-08-02 07:33:50 UTC
Comment 6 eric.savary 2010-08-02 14:57:08 UTC
This issue is "as is" invalid because it deals with different problems in one
same report.
I split it in issue 113584 and issue 113587.

Feel free to file a separate task for the dictionary problem after have checked
that there is no duplicate for this.
Comment 7 eric.savary 2010-08-02 14:57:19 UTC
Comment 8 jurf 2010-08-03 02:32:42 UTC
Thanks for splitting this.
FYI, the dictionary bug was solved in CWS tl81, integrated in OOO330m2.