Bug 60953 - Improve Big5 handling for Word 6.0
Summary: Improve Big5 handling for Word 6.0
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-04 12:16 UTC by Tim Allison
Modified: 2017-04-05 02:05 UTC (History)
0 users



Attachments
Example bilingual English/Chinese Big5 Word 6.0 file (243.00 KB, application/msword)
2017-04-04 12:16 UTC, Tim Allison
Details
Another example file (676.50 KB, application/msword)
2017-04-05 02:05 UTC, Tim Allison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-04-04 12:16:42 UTC
Created attachment 34898 [details]
Example bilingual English/Chinese Big5 Word 6.0 file

While working on Bug 50955, I found that MS had their own encoding of Big5, which included zero padding for ascii characters.

I included some code that ought to be cleaned up.

An example of Big5 used to encode English is already in our set: Bug51944.doc.

Some notes will follow.

I'm also attaching a better bilingual Big5 English/Chinese example from Apache Tika's Common Crawl corpus.

Many thanks, again, to Common Crawl, Dominik Stadler and Rackspace.
Comment 1 Tim Allison 2017-04-04 12:18:53 UTC
It would also be handy if we could find some Shift-JIS examples.

Word95.doc has a Shift-JIS encoded font, but the text is all single byte English.  Given that we can't map from fonts to text pieces, it isn't clear to me that this is actually what Shift-JIS looks like or if the English is really Times New Roman
Comment 3 Tim Allison 2017-04-05 02:05:29 UTC
Created attachment 34899 [details]
Another example file

This file comes from the same source as the other attachment.  It contained a few 0xf9xx characters that the original file did not.