|Summary:||Figure out how to map font to runs/text pieces in Word 6.0 files|
|Product:||POI||Reporter:||Tim Allison <tallison>|
|Component:||HWPF||Assignee:||POI Developers List <dev>|
Description Tim Allison 2017-04-04 02:23:42 UTC
Created attachment 34897 [details] triggering file In bug 50955, I found that the hack of using the first non-default/non-symbol font in the font table in Word 6.0 files worked fairly well. There was one file out of ~1300 for which this failed. I'm attaching that file. The issue in this file is that cp1257 comes before 1251 in the font table. I wasn't able to figure out how to 1) determine that 1251 should be the default or 2) how to map the font encodings to runs/text pieces. The test file comes from Common Crawl.
Comment 1 Nick Burch 2017-04-04 12:29:34 UTC
There's some potentially insightful comments in the abiword source at https://github.com/AbiWord/wv/blob/master/text.c#L123 . It suggests that word 6 or 7 the charset can depend on the font, and that for word 6 or 7 the Far East flag in the FiB controls if it's a 1 byte or 2 byte encoding used
Comment 2 Tim Allison 2017-04-04 12:33:15 UTC
Nice! Does this mean we ought to go looking for "regular" 7.0 .doc files that contain Big5/Shift-JIS, too?