Issue 125726

Summary: Semantical text treatment for comparison (when searching for changes between texts)
Product: Writer Reporter: Anton <zao>
Component: editingAssignee: AOO issues mailing list <issues>
Status: UNCONFIRMED --- QA Contact:
Severity: Major    
Priority: P5 (lowest) CC: ramona.tripa, toki.kantoor
Version: 4.1.1   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: 4.1.1
Developer Difficulty: ---
Attachments:
Description Flags
Two text files with different layout making comparison look ugly
none
Basic comparison of docs with identical content but different formatting
none
Extra space determines subsequent content to be treated as new insertion/deletion none

Description Anton 2014-10-08 12:13:42 UTC
Created attachment 84047 [details]
Two text files with different layout making comparison look ugly

The OOWriter v4.1.1 built-in "Edit->Compare Document..." produces a rather unusable results when comparing texts with different layout.

Semantically same sentences are considered different just because there was the <CR><LF> somewhere before the semantic terminator (i.e. a full stop, question or exclamation sign, etc) in one of the samples being compared.

If comparator code would disregard such non-semantic entities and only pay attention for semantic terminators - its' output would look much more nicely.

Here are the two my real-life examples which produce an unusable compare result.
Comment 1 Ramona 2015-01-24 23:35:06 UTC
I have also encountered this issue with both the recent OpenOffice 4.1.1, and the older 4.1.0 version – on Windows XP, as on Mac OS X Yosemite (the issue is not configuration-dependent).

The issue is: the result of using the “Compare Document” feature in its current implementation is quite confusing in the context of documents where the edited version has a similar content but a different layout, a different flow of text than the original version - as exemplified by Anton, the original reporter.

Steps to reproduce
1. Take a document and an edited version of that document. Ensure the two docs have the same content but a different flow of text, the edited version including changes of the type:
- line breaks
- paragraphs split into multiple subsequent paragraphs or sentences
- bullets inserted
- additional commas inserted
- extra space between two words in a sentence.

2. Open the edited document and then go to “Edit” -> “Compare Document...”.

3. Using the file selection dialog which appears, select the original document and confirm the dialog.

Results
OpenOffice combines both documents into the reviewer's doc.
The fragments of text affected by the new formatting are marked as new insertions / deletions.  

Expected
The OpenOffice “Help” on comparing documents specifies that: “All text passages that occur in the reviewer's document but not in the original are identified as having been inserted, and all text passages that got deleted by the reviewer are identified as deletions”.

As the description suggests, the user expects that new content is marked as insertion, whereas no longer existing content is marked as deletion. 

The "Compare Document" feature, however, does not distinguish between semantic changes and formatting changes of the type mentioned above. Both are treated alike.

Thus, passages or fragments of sentences marked as (new) insertions are old passages with new formatting or fragments now preceded by a (previously forgotten) comma. Similarly, passages marked as deletions are passages still included in the edited version but under a slightly different form.

The user discovers that even a basic extra space inserted between two words in the edited version of the doc will lead to subsequent content (up to the punctuation mark) being marked as insertion / deletion.
Please refer to the screen captures attached. 

Treating formal changes like semantic changes increases confusion and diminishes usability.

I have discovered in the database two older reports that basically point out to the same issue, including unhappy comments from people for which this feature is essential to their (proofreading) work, and suggestions arising from comparisons with other similar products:
https://issues.apache.org/ooo/show_bug.cgi?id=49217
https://issues.apache.org/ooo/show_bug.cgi?id=54195
Comment 2 Ramona 2015-01-24 23:37:07 UTC
Created attachment 84466 [details]
Basic comparison of docs with identical content but different formatting
Comment 3 Ramona 2015-01-24 23:38:13 UTC
Created attachment 84467 [details]
Extra space determines subsequent content to be treated as new insertion/deletion
Comment 4 jonathon 2015-01-27 20:15:54 UTC
Apache OpenOffice does not include a semantic markup language. Consequently, it can not differentiate between changes in raw content, semantic markup, and presentation markup.

<CR><LF> can be (ab)used as semantic markup, presentation markup, or raw content. As such, the safest assumption is to assume that it is raw content. As such, "Edit>Compare Document..." works as expected.

The fix would be to add a semantic markup language. Something that would vastly increase the complexity of creating content.
Comment 5 Anton 2015-02-02 15:20:21 UTC
As a topic starter I'm disappointed by jonathon's reply (from 2015-01-27 20:15:54 UTC). The behaviour is far from "as expected", unfortunately.

As Ramona has clearly shown above - this is a known and long-standing bug making the red-lining feature in OOWriter useless.
I have to use MSOffice2010 to do this (text comparison) job.

As of now - there is no need for a "semantic mark-up language" or anything as complex.
An easy fix would be to just ignore the <CR><LF> or <EOL> etc while comparing the content.

When this basic and USEFUL implementation would be there - the further enhancements (like changed italics and or bulleting) could be introduced later.

We must have a useful basic feature first - which is clearly missing now.
More so because the DeltaXMLODTCompare extension is no longer supported/developed to the extent that it does not handle the ODF v1.2.

I hope this will be addressed soonest.

Thank you in advance.

Regards,
Tony