Bug 60952 - Figure out how to map font to runs/text pieces in Word 6.0 files
Summary: Figure out how to map font to runs/text pieces in Word 6.0 files
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2017-04-04 02:23 UTC by Tim Allison
Modified: 2017-04-04 12:33 UTC (History)
0 users

triggering file (45.50 KB, application/msword)
2017-04-04 02:23 UTC, Tim Allison

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-04-04 02:23:42 UTC
Created attachment 34897 [details]
triggering file

In bug 50955, I found that the hack of using the first non-default/non-symbol font in the font table in Word 6.0 files worked fairly well.  There was one file out of ~1300 for which this failed.

I'm attaching that file.  The issue in this file is that cp1257 comes before 1251 in the font table.

I wasn't able to figure out how to 1) determine that 1251 should be the default or 2) how to map the font encodings to runs/text pieces.

The test file comes from Common Crawl.
Comment 1 Nick Burch 2017-04-04 12:29:34 UTC
There's some potentially insightful comments in the abiword source at https://github.com/AbiWord/wv/blob/master/text.c#L123 . It suggests that word 6 or 7 the charset can depend on the font, and that for word 6 or 7 the Far East flag in the FiB controls if it's a 1 byte or 2 byte encoding used
Comment 2 Tim Allison 2017-04-04 12:33:15 UTC
Nice!  Does this mean we ought to go looking for "regular" 7.0 .doc files that contain Big5/Shift-JIS, too?