Created attachment 34897 [details] triggering file In bug 50955, I found that the hack of using the first non-default/non-symbol font in the font table in Word 6.0 files worked fairly well. There was one file out of ~1300 for which this failed. I'm attaching that file. The issue in this file is that cp1257 comes before 1251 in the font table. I wasn't able to figure out how to 1) determine that 1251 should be the default or 2) how to map the font encodings to runs/text pieces. The test file comes from Common Crawl.
There's some potentially insightful comments in the abiword source at https://github.com/AbiWord/wv/blob/master/text.c#L123 . It suggests that word 6 or 7 the charset can depend on the font, and that for word 6 or 7 the Far East flag in the FiB controls if it's a 1 byte or 2 byte encoding used
Nice! Does this mean we ought to go looking for "regular" 7.0 .doc files that contain Big5/Shift-JIS, too?