60936 – Figure out charset in Word 6.0 files

Bug 60936 - Figure out charset in Word 6.0 files

Summary: Figure out charset in Word 6.0 files

Status:	RESOLVED DUPLICATE of bug 50955

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	3.16-dev
Hardware:	PC All

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-03-29 16:21 UTC by Tim Allison
Modified:	2017-03-31 18:53 UTC (History)
CC List:	0 users

Attachments
Bug60936 test doc (6.50 KB, application/msword) 2017-03-30 17:51 UTC, Tim Allison	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tim Allison 2017-03-29 16:21:26 UTC

On TIKA-2313, Steven Hall submitted an example Word 6.0 file whose extracted text is garbage.

From what I can tell so far, our more modern code to check for isUnicode in TextPieceTable should not be used on Word 6.0 files.  If I disable that, the text is correctly extracted.

We should figure out what mechanism was used in Word 6.0 files to determine codepage, and we should look into disabling the isUnicode check for Word 6.0 files.

Comment 1 Tim Allison 2017-03-30 16:11:32 UTC

I'm not able easily to figure out how the code page was encoded.

I could only find Win1252 encoded docs (on a quick look) in Tika's regression corpus.

I was able to generate a win1250 via OpenOffice, which I'll attach shortly.

From that file, it looks like the codepage _might_ be encoded in 2 ways.

1) (pure guess) in the font information, value "EE" at 133B is the code for Windows-1250. 

2) "0504" at 0F5E-0F5F specifies the Czech language


To test my guesses, I tried modifying each.

1) If I modify the "EE" to "00" default, ansi, the text is still correctly rendered in Word.

2) However, if I modify the 0504 to 0409 (U.S. English), the text is corrupted.

This means that Word and OpenOffice are inferring the code page from the language, and preferring that information to the codepage...unless I'm wrong about "EE".

I propose opening a half-step issue (60942) to avoid the Unicode check for Word 6.0.  This at least prevents quite a few exceptions in our test corpus.

Comment 2 Tim Allison 2017-03-30 17:51:48 UTC

Created attachment 34891 [details]
Bug60936 test doc

This should include a small r with caron.

I generated this file with OpenOffice.

I'll commit this as an ignored test when I commit the fix to BUG 60942.

Comment 3 Tim Allison 2017-03-30 17:54:30 UTC

For posterity, I also checked the FIB's LID, the default system's language, and that was 1033 (English)...so, that doesn't help with this doc.

Comment 4 Tim Allison 2017-03-30 18:06:10 UTC

See https://bz.apache.org/bugzilla/show_bug.cgi?id=50955 where one commenter supports our guess that 6.0 does not contain Unicode.  We still need to figure out how to get the right codepage.

Comment 5 Tim Allison 2017-03-31 18:37:10 UTC


*** This bug has been marked as a duplicate of bug 50955 ***

Comment 6 Tim Allison 2017-03-31 18:53:41 UTC

>1) If I modify the "EE" to "00" default, ansi, the text is still correctly rendered in Word.

>2) However, if I modify the 0504 to 0409 (U.S. English), the text is corrupted.

However, if I modify "EE" to the Cyrillic codepage, the text is corrupted.  This suggests that the correct way to handle codepage/languages:

1) get the default language from the block starting 02 75 ...
2) somehow map that to a character code
3) if the font's codepage is "00", use the  default language->codepage, otherwise use the font's codepage