Issue 44691 - RTF: two pages instead of one due to wrong frame positions
Summary: RTF: two pages instead of one due to wrong frame positions
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: open-import (show other issues)
Version: 680m84
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-03-10 10:23 UTC by richlv
Modified: 2014-11-02 07:17 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.1.1
Developer Difficulty: ---


Attachments
Demonstration of AOO 4.1.1 Extra Page from RTF (4.74 KB, text/rtf)
2014-11-01 19:48 UTC, orcmid
no flags Details
Screen Shot of the Apache 4.1.1 Anonymized Document View (53.88 KB, image/png)
2014-11-01 19:54 UTC, orcmid
no flags Details
The Character Encoding of the Original RTF - Screenshot (30.08 KB, image/png)
2014-11-01 22:50 UTC, orcmid
no flags Details
AOO 4.1.1 Proper Treatment of Original Character Set (30.61 KB, image/png)
2014-11-01 23:00 UTC, orcmid
no flags Details
AOO 4.1.1 Proper Display of Corrected RTF (31.52 KB, image/png)
2014-11-01 23:09 UTC, orcmid
no flags Details
Corrected RTF for proper character-set treatment (1.14 KB, text/rtf)
2014-11-01 23:14 UTC, orcmid
no flags Details
Image of the Corrected RTF file (114.75 KB, image/png)
2014-11-01 23:26 UTC, orcmid
no flags Details
How about Latvian? (32.48 KB, image/png)
2014-11-02 07:14 UTC, orcmid
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description richlv 2005-03-10 10:23:36 UTC
i have a msword document that has a couple of problems. first, it is displayed 
with wrong diacritic symbols (which probably is the same as issue 27371).

second (and a reason for this issue) is pagination. in msword the document 
consists of one page only. when open in oo.org, it has an empty page inserted.

i can send the document to interested qa/dev contact.
Comment 1 michael.ruess 2005-03-10 11:00:38 UTC
Please send the document to mru@openoffice.org. Thanks a lot!
Comment 2 michael.ruess 2005-03-11 13:43:54 UTC
MRU->FLR: I will give you the document via mail, it is confidential. 
The frames are all anchored on the wrong page, because an obolete page break has
been setin the second paragraph.
Comment 3 michael.ruess 2005-03-11 13:49:09 UTC
Reassigned to FLR.
Comment 4 Mathias_Bauer 2006-01-20 16:53:26 UTC
We will not finish this until 2.0.2 code freeze -> retargetting to 3.0
Comment 5 Mathias_Bauer 2006-07-04 14:11:31 UTC
Due to missing resources retargetted to "OOo Later"
Comment 6 Mathias_Bauer 2006-08-30 15:16:32 UTC
reassigning to hbrinkm
Comment 7 orcmid 2014-10-30 19:36:42 UTC
I am changing this report to UNCONFIRMED.

There are no attachments and insufficient details to even create a test case for this report.

If the original problem was about a DOC format problem opening in OpenOffice.org, there is no basis to tie this to RTF problems or to Mac problems.

I recommend that this bug be closed for insufficient detail to dealt with at this point, unless richlv is around and can verify that the problem remains using a current Apache OpenOffice distribution.
Comment 8 orcmid 2014-10-30 19:38:07 UTC
(In reply to orcmid from comment #7)
> I am changing this report to UNCONFIRMED.
> 
> There are no attachments and insufficient details to even create a test case
> for this report.
> 
> I recommend that this bug be closed for insufficient detail to dealt with at
> this point, unless richlv is around and can verify that the problem remains
> using a current Apache OpenOffice distribution.

I posted #7 before I was ready.  Above is all I have to say.
Comment 9 richlv 2014-10-30 19:53:04 UTC
a testcase document was provided 9 years ago - by now i don't recall what document that was, and i'm not sure i'm in possession of it anymore either. unless mru still has the document, there probably is no need to keep this issue open.

...ok, this is scary. i actually found the testcase document in my sent mails.

will try to test it now :)
Comment 10 richlv 2014-10-31 00:32:55 UTC
the problem still reproducible with :
AOO411m6(Build:9775)  -  Rev. 1617669
2014-08-13 09:23 - Linux x86_64

creating a more limited doc is not that easy as i don't have access to microsoft office.
Comment 11 orcmid 2014-10-31 01:09:59 UTC
(In reply to richlv from comment #10)
> the problem still reproducible with :
> AOO411m6(Build:9775)  -  Rev. 1617669
> 2014-08-13 09:23 - Linux x86_64
> 
> creating a more limited doc is not that easy as i don't have access to
> microsoft office.

Without having a document to examine, I can't tell much.  I have Microsoft Office as well as Apache OpenOffice.  

You can send the original Microsoft version to me and I will (1) confirm the defect and (2) figure out how to make a demonstration that carries no private information.

<mailto:orcmid@apache.org>
Comment 12 orcmid 2014-10-31 03:58:30 UTC
(In reply to orcmid from comment #11)
> (In reply to richlv from comment #10)
> > the problem still reproducible with :
> > AOO411m6(Build:9775)  -  Rev. 1617669
> > 2014-08-13 09:23 - Linux x86_64
[ ... ]
> You can send the original Microsoft version to me and I will (1) confirm the
> defect and (2) figure out how to make a demonstration that carries no
> private information.

I have received the .DOC file which is the basis for this defect report.  It is an RTF that appears to be machine-generated.  All text of the 1-page document is in fields that are placed to fixed positions on the page.

When Apache OpenOffice opens the document, it is recognized as RTF and a blank page is produced before the single-page document.  This blank page is apparently because there is an initial empty paragraph and this causes the fields to go to a new page to avoid collision.  (LibreOffice 4.3 has a similar problem, and a page break shows up too.)

Deleting the first of the blank page has the document become a single page with correct formatting.  (It does not save successfully as an RTF, but is saves fine as an ODT.)

It will take a little more effort to disguise the private information in the document so it can be attached here as a demonstration of an RTF that triggers this problem. (Oddly, Microsoft Word does not do this although there are other deviations that arise because of the fixed formatting of the document as all fields placed to specific coordinates on the page.)
Comment 13 orcmid 2014-10-31 16:41:14 UTC
(In reply to orcmid from comment #12)
 
> I have received the .DOC file which is the basis for this defect report.  It
> is an RTF that appears to be machine-generated.  All text of the 1-page
> document is in fields that are placed to fixed positions on the page.
[ ... ]
> It will take a little more effort to disguise the private information in the
> document so it can be attached here as a demonstration of an RTF that
> triggers this problem. (Oddly, Microsoft Word does not do this although
> there are other deviations that arise because of the fixed formatting of the
> document as all fields placed to specific coordinates on the page.)

I should add that there are noticeable differences between how "diacritical marks" appear when the subject RTF is viewed in Microsoft Office Word 2013 on an en-US configured system and with Apache OpenOffice 4.1.1 on the same system. I have no idea which is correct and they might both be incorrect as well as different.

It seems to me that richlv is correct in not piggy-backing that problem onto this issue.

BACKGROUND

The most workable way overcome the layout and character-glyph problem is to produce documents such as the one tied to this issue using PDF/A on a system that shows the correct text and layout.  That's especially the case for brittle documents, probably computer-derived, that are intended only for viewing/printing and not collaborative editing.  The kind of fidelity expected in interchange and especially after format conversion is quite far from the target reach of office-productivity and personal-productivity software applications such as OpenOffice (and Microsoft Office).

Part of the problem appears to be related to font mappings and how substitutions are made.  This is always a problem between older 8-bit document formats and Unicode-based ones.  In the case of RTF, the subject document is specified to be in \rtf0\ansi format, but the font it calls for is "\fswiss Arial Baltic;".  How this gets mapped on computers that do not have a font from the 8-bit world with exactly that name, nor any correctly-mapped substitution to an available sans-serif font, is always and forever a crapshoot.

There is an unintended consequence for the the use of fields, however. Automatic font substitutions will lead to different metrics and usually-minor differences in appearances.  They can be ignored in most cases, but not when someone has used field definitions to attempt some sort of pixel-perfect layout fidelity.  Those two situations collide, making this application of RTF a brittle "works-for-me" disconnect.  The document will appear fine to its creator, but successful delivery of the same appearance on another platform is at best coincidence.  And unless the parties think to exchange screen captures, neither will understand what the other is observing and what the differences are.

That's a more-global problem.  My point is that some tweaking may be possible, but the situation is simply unstable and I don't think, from a triage perspective, there is not much prospect for a solution, certainly not in the near-term. (And this bug report is over 10 years old already.)

The bottom line is that our tools do not provide for control at this level of fine detail, nor do we even possess good forensic tools to aid in recognizing the discrepancies, let alone repair them.  It seems to me that we are very far from the needs and expectations of casual users, and these details are beyond the reach of most power users as well.
Comment 14 orcmid 2014-10-31 16:55:01 UTC
(In reply to orcmid from comment #13)

> I should add that there are noticeable differences between how "diacritical
> marks" appear when the subject RTF is viewed in Microsoft Office Word 2013
> on an en-US configured system and with Apache OpenOffice 4.1.1 on the same
> system. I have no idea which is correct and they might both be incorrect as
> well as different.

[I managed to say "not" where it contradicted what I meant. Here is what I meant to say.]

My point is that some tweaking may be possible, but the situation is simply unstable and I don't think, from a triage perspective, there is much prospect for a solution, certainly not in the near-term. (And this bug report is over 10 years old already.)
Comment 15 orcmid 2014-11-01 19:26:27 UTC
WORKAROUND #1:

If you have an RTF (possibly a .DOC that is actually an RTF) and it opens with a blank page shown in front of the first page of the document, try this.

In Apache OpenOffice 4.1.1, click menu item "View".  Click the "Print Layout" button.  This will switch to a "Web View" which, for single-page documents, won't have the field conflict.

WORKAROUND #2:
Notice, in the web view, that the cursor is in the upper left of the page, and not in the form.  Click the DELete key once.  Then go to the "View" menu and click "Print Layout."  The blank page should be gone.

DON'T SAVE THE DOCUMENT THOUGH. The export as RTF is not reliable enough.  The opened document can be printed though.  

DEVELOPERS AND QA:
 For these Slovak language forms, the \rtf1\ansi form of the RTF file is in Windows-1250 encoding.  Be careful about that.
Comment 16 orcmid 2014-11-01 19:48:45 UTC
Created attachment 84145 [details]
Demonstration of AOO 4.1.1 Extra Page from RTF

This is an anonymized RTF that demonstrates the stray blank first page from certain RTF files.  Viewing this file in Apache OpenOffice 4.1.1 should show an extra leading page and a number of fields arranged on the second page.  These are the same initial fields, with different content, of the document that this issue is based on.
Comment 17 orcmid 2014-11-01 19:54:30 UTC
Created attachment 84146 [details]
Screen Shot of the Apache 4.1.1 Anonymized Document View

This screen shot demonstrates how the form with its arrangement of fields appears following a stray blank page in Apache OpenOffice 4.1.1.

Note that there is no use of non-ASCII characters in this version of the document, so any issues about transposition of Windows-1250 character codes to correctly-presented Unicode are not reflected here.
Comment 18 orcmid 2014-11-01 22:50:02 UTC
Created attachment 84147 [details]
The Character Encoding of the Original RTF - Screenshot

THE ORIGINAL RTF HANDLES CHARACTER ENCODING AND DIACRITICAL MARKS INCORRECTLY

Although the blank page before the correct first page of the document is an open problem, the problems of character-set encoding and diacritical marks are problems with the RTF, not how it is handled by any of Apache OpenOffice 4.1.1, LibreOffice 4.3, or Microsoft Office Word 2013 in non-Slovak editions.

The problem with original RTF (and Microsoft Office before Office 97) is that the only means for recording text is in 8-bit character encodings.  When the \rtf1\ansi format is used, the problem is to know *which* 8-bit character-encoding is involved.  Normally, RTF would use 7-bit ANSI coding.  Generally, depending on the default code page of earlier operating system versions and Microsoft Office versions, a particular character-set such as Windows 1252 (US) and language such as 1033 (en-US) would be used.

There are provisions in RTF to be specific about what character-set encoding is being employed.  And most RTF processors will accept 8-bit character codes even though original RTF was only good for 7-bit character codes.

None of this is done in the original RTF document that is associated with the current issue.  Note that this is different with other issues about character sets that has to do with whether or not the characters are rendered pleasantly or not, which is a matter of font choice.  This is about the computer byte codes used to specify what characters.

The screen shot with this comment shows a fragment of the RTF original document which I have reduced to a single paragraph having a selection of the special characters that are found in the original (but not the anonymized version that I produced to demonstrate the blank-page problem).

Notice, in the bottom margin, that I had to instruct my software to present that RTF format using the Windows 1250 code page, the 8-bit character set that is designed for Eastern European languages.  The characters, "ÇÍ âç îđîî ěěî ç îţ ěěî ďâçâ" can be examined to confirm that these are some of the expected special Slovak language characters in the original document.
Comment 19 orcmid 2014-11-01 23:00:13 UTC
Created attachment 84148 [details]
AOO 4.1.1 Proper Treatment of Original Character Set

Because nothing about the original RTF document identifies it as using Slovak language and the Windows 1250 (East European) character set, when the document, reduced to its special characters is opened in AOO 4.1.1 in an en-US configuration on Windows 8.1, AOO has little choice but to employ the local default, Windows 1252.  That has the document of only special letters that I produced show them as

"ÇÍ âç îðîî ììî ç îþ ììî ïâçâ" which is for Windows 1252, not the

"ÇÍ âç îđîî ěěî ç îţ ěěî ďâçâ" which is in the RTF interpreted as Windows 1250.

Note, also, that Apache OpenOffice is indicating, in the status line, that the document is in English (USA).  In fact, the original RTF specifies that, wherever \lang1033 is seen in that RTF.
Comment 20 orcmid 2014-11-01 23:09:10 UTC
Created attachment 84149 [details]
AOO 4.1.1 Proper Display of Corrected RTF

I modified the little RTF document with nothing but the few Slovak alphabet characters to make explicit that the document is in the Slovak language and that it is using the Windows 1250 character set.

There is now the correct presentation of the character codes in the file.

(Note that the character set is being shown as Arial Baltic in both views.  I suspect that there was a substitution made, but the name specified in the RTF was used.
Comment 21 orcmid 2014-11-01 23:14:08 UTC
Created attachment 84150 [details]
Corrected RTF for proper character-set treatment

Here is the corrected RTF of the sample file that was reduced to have only a few special characters.  This should display as shown in the AOO 4.1.1 Proper Display of Corrected RTF.  This should work on any configuration of AOO 4.1.1 or LibreOffice 4.3, with any internationalization, so long as good Unicode fonts are available.
Comment 22 orcmid 2014-11-01 23:26:06 UTC
Created attachment 84151 [details]
Image of the Corrected RTF file

This is the complete RTF file, shown as a text file, that provides the correction necessary to have the language identifed as Slovak (\deflang1051 and everywhere that \lang1051 appears).  It also adds enough information to the definition of font 6, to let any consumer know that charset238 (corresponding to code page windows 1250) is to be understood as expected.  So whatever font AOO ends up using (since Arial Baltic is likely not available), the correct understanding of the character-encoding will be employed.

(I often speak of Windows 1250 as providing a character set, it is really what is called a code page, and its character set is #238.)

I could have made the change for all of \f1-\f15, but only \f6 is needed in this particular file.

This is the solution to the presentation of Slovak language and the 8-bit Windows-1250 code page in RTF for consumption on all configurations of OpenOffice installations.  The RTF must be made specific so that will work.
If there is any control over the original documents, they should be upgraded to work with Unicode and avoid the code-page/character-set problems altogether.

(I haven't found the source of the extra blank page in front of the document. I think I've earned my pay for today just the same and am going to have my weekend now.)
Comment 23 richlv 2014-11-02 03:10:44 UTC
for the record, original document is in latvian ;)
Comment 24 orcmid 2014-11-02 06:08:58 UTC
(In reply to richlv from comment #23)
> for the record, original document is in latvian ;)

Ahah!

I will make the changes in my copies, although I don't think that will change the code page.  If anything different comes out of it, I'll post corrections.

I should have paid more attention to the mention of Riga [;<).
Comment 25 orcmid 2014-11-02 07:14:04 UTC
Created attachment 84153 [details]
How about Latvian?

Rich's comment explains why I was having trouble getting Microsoft Word 2013 to cooperate.  It was showing the Baltic code page, not the East European one regardless of what I did to the RTF.  I  suspect Word is designed to recognize the "Arial Baltic" font name.

This image reflect changes of the RTF to explicitly identify Latvian and use the Baltic code page.  You can see the difference in language in the status line.

The changes to the RTF that I posted consists of the following modifications:

    \ansicpg1250 -> \ansicpg1257
    \deff0 -> \deff6
    \deflang1051 -> \deflang1062
    \fcharset238 -> \fcharset186
    \cpg1250 -> \cpg1257
    \lang1051 -> \lang1062

I also changed the {\f6\fs20 ...} having the text to {f6\fs18 ...} to keep the text in one line within the field by dropping from 10pt to 9pt.

This screen shot shows what I hope is the correct Latvian characters.  This is produced by Apache OpenOffice 4.1.1, LibreOffice 4.3, and Microsoft Word 2013 when opening the RTF as now modified for Latvian.
Comment 26 richlv 2014-11-02 07:17:10 UTC
those are latvian diacritic characters (it doesn't make any sensible word, though :) )