Issue 128549 - Wrong accentuated characters from old .rtf files
Summary: Wrong accentuated characters from old .rtf files
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: open-import (show other issues)
Version: 4.1.13
Hardware: All All
: P5 (lowest) Normal (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: regression
: 128550 (view as issue list)
Depends on:
Blocks:
 
Reported: 2023-01-02 21:56 UTC by Alain Filhol (linus38120)
Modified: 2023-01-06 04:27 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---


Attachments
A RTF file dating back to 1990 containing French accentuated characters and pict-images. (82.66 KB, text/rtf)
2023-01-02 21:56 UTC, Alain Filhol (linus38120)
no flags Details
screenshot from OOo 2.1, AOO 4.1.13 and LO 7.3.6.2 (156.49 KB, image/png)
2023-01-04 07:31 UTC, Czesław Wolański
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description Alain Filhol (linus38120) 2023-01-02 21:56:54 UTC
Created attachment 87159 [details]
A RTF file dating back to 1990 containing French accentuated characters and pict-images.

Let me 1st stress that macOS OpenOffice is the last software I know that is still able to open early 90's RTF files containing PICT (or PCT) bitmap/vectorial images. This an invaluable tool when, like me, you're digging old corporate archives for an historical work. Long live OpenOffice!

Both LibreOffice and Microsoft Word import only the text from those old .rtf files but miss an embedded pict-image converter.
However OpenOffice has a problem with accentuated characters from these old files. Each one is displayed as an invalid character as shown below:
OpenOffice: Ž   ‘ ‰ Š ” Ÿ ž Ÿ  Ï 
Hexa      : C5BD EF8690 EF868F E28098 E280B0 C5A0 E2809D C5B8 C5BE C5B8 EF868D C38F
Characters: é ê è ë â ä î ü û ü ç œ
Hexa      : C3A9 C3AA C3A8 C3AB C3A2 C3A4 C3AE C3BC C3BB C3BC C3A7 C593

This is not a critical bug but it should be relatively easy to correct it since both LibreOffice and Word display the right accentuated characters.

The attached file is a .rtf file dating back to 1990 and containing both a lot of accentuated characters and some pict-images.
It was prepared on Mac OS (classic) with the vintage WriteNow 4.0 <https://en.wikipedia.org/wiki/WriteNow> which wrongly adds a space after each accentuated character.
Comment 1 Peter 2023-01-03 19:48:33 UTC
*** Issue 128550 has been marked as a duplicate of this issue. ***
Comment 2 Czesław Wolański 2023-01-04 07:30:19 UTC
The author of this report might be, I presume,
also the author of the bug report filed with the LibreOffice Bugzilla.
tdf#152697 - Problems with old .rtf files
https://bugs.documentfoundation.org/show_bug.cgi?id=152697

In the 4th comment thereto, Regina Henschel advised to use OpenOffice.org 2.1 portable to open and convert .pct images.

I thought it might be worth checking if OpenOffice.org 2.1 or older releases
experience the problem pointed by the reporter i.e.
"accented characters from those old files".

Result: 1.1.5 and 2.1 seem OK.

See the attached image with screenshots.
Comment 3 Czesław Wolański 2023-01-04 07:31:31 UTC
Created attachment 87160 [details]
screenshot from OOo 2.1, AOO 4.1.13 and LO 7.3.6.2
Comment 4 damjan 2023-01-05 01:26:30 UTC
Confirming based on screenshot. The character handling is a regression from OpenOffice 2.1.

The source code for Writer's RTF parsing is in:
main/sw/source/filter/rtf
which subclasses the lower-level RTF parser in:
main/svtools/source/svrtf

In the attached RTF, the subtitle text "Mac et vidŽ o !" is encoded as:

00000240  70 36 34 20 5c 62 20 4d  61 63 20 65 74 20 76 69  |p64 \b Mac et vi|
00000250  64 5c 27 38 65 20 6f 20  21 20 5c 66 73 32 38 20  |d\'8e o ! \fs28 |

so the "\'8e " (5c 27 38 65 20) is coming through as "Ž " instead of "é ".

That trailing space is shown in OpenOffice, LibreOffice and Calibre. So let's ignore it for now.

We need to find where the 4 characters "\'8e" (5c 27 38 65) are parsed and why they are coming through as "Ž" (U+017D) instead of "é" (U+00E9).
Comment 5 damjan 2023-01-05 16:37:36 UTC
Our lower level RTF parser is in main/svtools/source/svrtf/parrtf.cxx, and SvRTFParser::_GetNextToken() calls SvRTFParser::ScanText() which parses the "\'8e" by treating it as 1 byte, in hexadecimal encoding. Other permissively licensed open-source projects like rtf.js do the same (https://github.com/tbluemel/rtf.js/blob/master/src/rtfjs/parser/Parser.ts#L422).

And the RTF 1.0 spec from https://latex2rtf.sourceforge.net/RTF-Spec-1.0.txt confirms it:

---snip---
  \'hh            A hexadecimal value, based on the specified
                  character set (may be used to identify 8-bit
                  values).
---snip---

So "\'8e" becomes the byte 0x8e, but then how does that become "é"?

What is this "specified character set"?

The file begins with:

00000000  7b 5c 72 74 66 30 5c 6d  61 63 20 0d 7b 5c 63 6f  |{\rtf0\mac .{\co|

and the RTF spec says under the "THE CHARACTER SET" section:

---snip---
    \mac           Apple Macintosh
---snip---

The "\mac" should be parsed in SvRTFParser::Continue() where we have:

---snip---
    654         case RTF_MACTYPE:       
    655             SetEncoding( eCodeSet = RTL_TEXTENCODING_APPLE_ROMAN );     
    656             break;
---snip---

as svtools/inc/svtools/rtfkeywd.hxx had:

---snip---
#define OOO_STRING_SVTOOLS_RTF_MAC "\\mac"
---snip---

Our character set conversions are generally done under main/sal/textenc, and in main/sal/textenc/tcvtlab1.tab we have:

---snip---
static sal_uInt16 const aImplAPPLEROMANToUniTab[APPLEROMANUNI_END - APPLEROMANUNI_START + 1] =
{
/*       0       1       2       3       4       5       6       7 */
/*       8       9       A       B       C       D       E       F */
    0x00C4, 0x00C5, 0x00C7, 0x00C9, 0x00D1, 0x00D6, 0x00DC, 0x00E1, /* 0x80 */
    0x00E0, 0x00E2, 0x00E4, 0x00E3, 0x00E5, 0x00E7, 0x00E9, 0x00E8, /* 0x80 */
---snip---

which would translate 0x8E into unicode 0x00E9, which is "é" (U+00E9), the expected character.

But we got "Ž" (U+017D) instead in this sample document. Searching that file for "17D" we see it comes from the table for the MS 1252 encoding:

---snip---
static sal_uInt16 const aImplMS1252ToUniTab[MS1252UNI_END - MS1252UNI_START + 1] =
{
/*       0       1       2       3       4       5       6       7 */
/*       8       9       A       B       C       D       E       F */
    0x20AC,      0, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, /* 0x80 */
    0x02C6, 0x2030, 0x0160, 0x2039, 0x0152,      0, 0x017D,      0, /* 0x80 */
---snip---

and indeed, if we look at the constructor for SvRTFParser, we see that's the encoding it initially sets:

---snip---
SvRTFParser::SvRTFParser( SvStream& rIn, sal_uInt8 nStackSize )
    : SvParser( rIn, nStackSize ),
    eUNICodeSet( RTL_TEXTENCODING_MS_1252 ),    // default ist ANSI-CodeSet
    nUCharOverread( 1 )
{
    // default ist ANSI-CodeSet
    SetSrcEncoding( RTL_TEXTENCODING_MS_1252 );
    bRTF_InTextRead = false;
}
---snip---

But SvRTFParser::Continue() must be getting called after the constructor, and it seems to set the "mac" encoding, so why is the wrong encoding still used?
Comment 6 damjan 2023-01-05 18:29:36 UTC
(In reply to damjan from comment #5)
> But SvRTFParser::Continue() must be getting called after the constructor,
> and it seems to set the "mac" encoding, so why is the wrong encoding still
> used?

Putting a breakpoint on SvParser::SetSrcEncoding(), and backtracing when it's called, shows it's called from the following places, in order:

1. The constructor, with "eEnc=1" meaning RTL_TEXTENCODING_MS_1252:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142


2. The CallParser() method, itself called from editeng/source/rtf/svxrtf.cxx method RtfReader::Read():

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7a85 in SvRTFParser::CallParser() (this=0x80db69e10) at source/svrtf/parrtf.cxx:593


3. The Continue() method when it finds the "\mac" instruction, now with "eEnc=2" meaning (the good) RTL_TEXTENCODING_APPLE_ROMAN:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=2) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7e63 in SvRTFParser::SetEncoding(unsigned short) (this=0x80db69e10, eEnc=2) at source/svrtf/parrtf.cxx:688
#2  0x0000000801dc7d02 in SvRTFParser::Continue(int) (this=0x80db69e10, nToken=262) at source/svrtf/parrtf.cxx:655


4. SvxRTFParser::ReadColorTable() still with the good "eEnc=2":

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=2) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc6a76 in SvRTFParser::_GetNextToken() (this=0x80db69e10) at source/svrtf/parrtf.cxx:268
#2  0x0000000801dcd1bf in SvParser::GetNextToken() (this=0x80db69e10) at source/svrtf/svparser.cxx:439
#3  0x00000008045fbd04 in SvxRTFParser::ReadColorTable() (this=0x80db69e10) at source/rtf/svxrtf.cxx:464


5. SvxRTFParser::ReadFontTable(), now with THE BAD "eEnc=1" !!!!!!!!!!!!!!

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7e63 in SvRTFParser::SetEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/parrtf.cxx:688
#2  0x00000008045fbd83 in SvxRTFParser::ReadFontTable() (this=0x80db69e10) at source/rtf/svxrtf.cxx:513

6. SvxRTFParser::ReadFontTable() again.
7. SvxRTFParser::ReadFontTable() again
8. SvxRTFParser::ReadFontTable() again but now with eEnc=2.

9. SvxRTFParser::ReadStyleTable() with eEnc=2:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=2) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc6a76 in SvRTFParser::_GetNextToken() (this=0x80db69e10) at source/svrtf/parrtf.cxx:268
#2  0x0000000801dcd1bf in SvParser::GetNextToken() (this=0x80db69e10) at source/svrtf/svparser.cxx:439
#3  0x00000008045fc241 in SvxRTFParser::ReadStyleTable() (this=0x80db69e10) at source/rtf/svxrtf.cxx:362


10. SvxRTFParser::ReadStyleTable() with eEnc=2.
11. SvxRTFParser::RTFPardPlain() with eEnc=1:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7e63 in SvRTFParser::SetEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/parrtf.cxx:688
#2  0x00000008045f851e in SvxRTFParser::RTFPardPlain(int, SfxItemSet**) (this=this@entry=0x80db69e10, bPard=bPard@entry=0, ppSet=ppSet@entry=0x7fffffffc248) at source/rtf/rtfitem.cxx:1969

12. SvxRTFParser::ReadAttr() with eEnc=1:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7e63 in SvRTFParser::SetEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/parrtf.cxx:688
#2  0x00000008045f6238 in SvxRTFParser::ReadAttr(int, SfxItemSet*) (this=0x80db69e10, nToken=1801, pSet=<optimized out>) at source/rtf/rtfitem.cxx:692

13. ReadBmpData() methods with eEnc=1:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db69e10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x00000008045f455b in SvxRTFParser::ReadBmpData(Graphic&, SvxRTFPictureType&) (this=0x80db69e10, rGrf=..., rPicType=...) at source/rtf/rtfgrf.cxx:304
#2  0x000000081010c3a2 in SwRTFParser::ReadBitmapData() (this=0x80db69e10) at source/filter/rtf/rtffly.cxx:1492


And a large number of other methods, ending with one that sets it back to eEnc=2:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db6ad10, eEnc=2) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc6a97 in SvRTFParser::_GetNextToken() (this=0x80db6ad10) at source/svrtf/parrtf.cxx:273
#2  0x0000000801dcd1bf in SvParser::GetNextToken() (this=0x80db6ad10) at source/svrtf/svparser.cxx:439
#3  0x0000000801dc7da8 in SvRTFParser::Continue(int) (this=0x80db6ad10, nToken=2059) at source/svrtf/parrtf.cxx:675
#4  0x00000008045fb6b4 in SvxRTFParser::Continue(int) (this=0x80db6ad10, nToken=2) at source/rtf/svxrtf.cxx:175
#5  0x000000081011afe3 in SwRTFParser::Continue(int) (this=0x80db6ad10, nToken=0) at source/filter/rtf/swparrtf.cxx:337
#6  0x0000000801dc7ad3 in SvRTFParser::CallParser() (this=0x80db6ad10) at source/svrtf/parrtf.cxx:600




Now if we also put a breakpoint on SvRTFParser::GetHexValue() to check when the \'8e is parsed relative to the setting of the text encoding, we see the most recent call set eEnc to 1:

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db6ad10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x0000000801dc7e63 in SvRTFParser::SetEncoding(unsigned short) (this=0x80db6ad10, eEnc=1) at source/svrtf/parrtf.cxx:688
#2  0x00000008045f6238 in SvxRTFParser::ReadAttr(int, SfxItemSet*) (this=0x80db6ad10, nToken=1801, pSet=<optimized out>) at source/rtf/rtfitem.cxx:692

And then another call from SvxRTFParser::ReadBmpData():

#0  SvParser::SetSrcEncoding(unsigned short) (this=0x80db6ad10, eEnc=1) at source/svrtf/svparser.cxx:142
#1  0x00000008045f4f31 in SvxRTFParser::ReadBmpData(Graphic&, SvxRTFPictureType&) (this=0x80db6ad10, rGrf=..., rPicType=...) at source/rtf/rtfgrf.cxx:578

sets it to 1, and it appears, the last call to SetSrcEncoding() that sets it back to 2, only happens after all the "\'xx" style text is already parsed.

So that explains why the wrong code page is used: almost every damn method from its editeng subclass SvxRTFParser calls SetSrcEncoding() with RTL_TEXTENCODING_MS_1252. Next let's explore that class to find out why.
Comment 7 damjan 2023-01-06 04:27:27 UTC
A hack such as the following, that stops some of the font-related code in main/editeng/source/rtf from calling SetEncoding(), gets the "é" and other characters to show correctly:

---snip---
diff --git a/main/editeng/source/rtf/rtfitem.cxx b/main/editeng/source/rtf/rtfitem.cxx
index 33b0a48153..363ff16aff 100644
--- a/main/editeng/source/rtf/rtfitem.cxx
+++ b/main/editeng/source/rtf/rtfitem.cxx
@@ -689,7 +689,7 @@ SET_FONTALIGNMENT:
                                        SetScriptAttr( eCharType, *pSet, aTmpItem );
                                        if( RTF_F == nToken )
                                        {
-                                               SetEncoding( rSVFont.GetCharSet() );
+//                                             SetEncoding( rSVFont.GetCharSet() );
                                                RereadLookahead();
                                        }
                                }
@@ -1963,7 +1963,7 @@ void SvxRTFParser::RTFPardPlain( int bPard, SfxItemSet** ppSet )
             if (nDfltFont != -1)
             {
                 const Font& rSVFont = GetFont(sal_uInt16(nDfltFont));
-                SetEncoding(rSVFont.GetCharSet());
+//                SetEncoding(rSVFont.GetCharSet());
             }
             else
                            SetEncoding(GetCodeSet());
---snip---


Bug 125495 dealt with a similar issue but allegedly fixed it and reverting that patch doesn't fix this. What they did in bug 68639 might explain why the system locale overwrites the Mac locale, although I am not sure that causes this bug yet.

The more I look at this, the more complex it gets :-/.