Bug 60936

Summary: Figure out charset in Word 6.0 files
Product: POI Reporter: Tim Allison <tallison>
Component: HWPFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Bug60936 test doc

Description Tim Allison 2017-03-29 16:21:26 UTC
On TIKA-2313, Steven Hall submitted an example Word 6.0 file whose extracted text is garbage.

From what I can tell so far, our more modern code to check for isUnicode in TextPieceTable should not be used on Word 6.0 files.  If I disable that, the text is correctly extracted.

We should figure out what mechanism was used in Word 6.0 files to determine codepage, and we should look into disabling the isUnicode check for Word 6.0 files.
Comment 1 Tim Allison 2017-03-30 16:11:32 UTC
I'm not able easily to figure out how the code page was encoded.

I could only find Win1252 encoded docs (on a quick look) in Tika's regression corpus.

I was able to generate a win1250 via OpenOffice, which I'll attach shortly.

From that file, it looks like the codepage _might_ be encoded in 2 ways.

1) (pure guess) in the font information, value "EE" at 133B is the code for Windows-1250. 

2) "0504" at 0F5E-0F5F specifies the Czech language

To test my guesses, I tried modifying each.

1) If I modify the "EE" to "00" default, ansi, the text is still correctly rendered in Word.

2) However, if I modify the 0504 to 0409 (U.S. English), the text is corrupted.

This means that Word and OpenOffice are inferring the code page from the language, and preferring that information to the codepage...unless I'm wrong about "EE".

I propose opening a half-step issue (60942) to avoid the Unicode check for Word 6.0.  This at least prevents quite a few exceptions in our test corpus.
Comment 2 Tim Allison 2017-03-30 17:51:48 UTC
Created attachment 34891 [details]
Bug60936 test doc

This should include a small r with caron.

I generated this file with OpenOffice.

I'll commit this as an ignored test when I commit the fix to BUG 60942.
Comment 3 Tim Allison 2017-03-30 17:54:30 UTC
For posterity, I also checked the FIB's LID, the default system's language, and that was 1033 (English)...so, that doesn't help with this doc.
Comment 4 Tim Allison 2017-03-30 18:06:10 UTC
See https://bz.apache.org/bugzilla/show_bug.cgi?id=50955 where one commenter supports our guess that 6.0 does not contain Unicode.  We still need to figure out how to get the right codepage.
Comment 5 Tim Allison 2017-03-31 18:37:10 UTC

*** This bug has been marked as a duplicate of bug 50955 ***
Comment 6 Tim Allison 2017-03-31 18:53:41 UTC
>1) If I modify the "EE" to "00" default, ansi, the text is still correctly rendered in Word.

>2) However, if I modify the 0504 to 0409 (U.S. English), the text is corrupted.

However, if I modify "EE" to the Cyrillic codepage, the text is corrupted.  This suggests that the correct way to handle codepage/languages:

1) get the default language from the block starting 02 75 ...
2) somehow map that to a character code
3) if the font's codepage is "00", use the  default language->codepage, otherwise use the font's codepage