Bug 60953

Summary: Improve Big5 handling for Word 6.0
Product: POI Reporter: Tim Allison <tallison>
Component: HWPFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: enhancement    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Example bilingual English/Chinese Big5 Word 6.0 file
Another example file

Description Tim Allison 2017-04-04 12:16:42 UTC
Created attachment 34898 [details]
Example bilingual English/Chinese Big5 Word 6.0 file

While working on Bug 50955, I found that MS had their own encoding of Big5, which included zero padding for ascii characters.

I included some code that ought to be cleaned up.

An example of Big5 used to encode English is already in our set: Bug51944.doc.

Some notes will follow.

I'm also attaching a better bilingual Big5 English/Chinese example from Apache Tika's Common Crawl corpus.

Many thanks, again, to Common Crawl, Dominik Stadler and Rackspace.
Comment 1 Tim Allison 2017-04-04 12:18:53 UTC
It would also be handy if we could find some Shift-JIS examples.

Word95.doc has a Shift-JIS encoded font, but the text is all single byte English.  Given that we can't map from fonts to text pieces, it isn't clear to me that this is actually what Shift-JIS looks like or if the English is really Times New Roman
Comment 3 Tim Allison 2017-04-05 02:05:29 UTC
Created attachment 34899 [details]
Another example file

This file comes from the same source as the other attachment.  It contained a few 0xf9xx characters that the original file did not.