Bug 60952

Summary: Figure out how to map font to runs/text pieces in Word 6.0 files
Product: POI Reporter: Tim Allison <tallison>
Component: HWPFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: enhancement    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: triggering file

Description Tim Allison 2017-04-04 02:23:42 UTC
Created attachment 34897 [details]
triggering file

In bug 50955, I found that the hack of using the first non-default/non-symbol font in the font table in Word 6.0 files worked fairly well.  There was one file out of ~1300 for which this failed.

I'm attaching that file.  The issue in this file is that cp1257 comes before 1251 in the font table.

I wasn't able to figure out how to 1) determine that 1251 should be the default or 2) how to map the font encodings to runs/text pieces.

The test file comes from Common Crawl.
Comment 1 Nick Burch 2017-04-04 12:29:34 UTC
There's some potentially insightful comments in the abiword source at https://github.com/AbiWord/wv/blob/master/text.c#L123 . It suggests that word 6 or 7 the charset can depend on the font, and that for word 6 or 7 the Far East flag in the FiB controls if it's a 1 byte or 2 byte encoding used
Comment 2 Tim Allison 2017-04-04 12:33:15 UTC
Nice!  Does this mean we ought to go looking for "regular" 7.0 .doc files that contain Big5/Shift-JIS, too?