Created attachment 34898 [details] Example bilingual English/Chinese Big5 Word 6.0 file While working on Bug 50955, I found that MS had their own encoding of Big5, which included zero padding for ascii characters. I included some code that ought to be cleaned up. An example of Big5 used to encode English is already in our set: Bug51944.doc. Some notes will follow. I'm also attaching a better bilingual Big5 English/Chinese example from Apache Tika's Common Crawl corpus. Many thanks, again, to Common Crawl, Dominik Stadler and Rackspace.
It would also be handy if we could find some Shift-JIS examples. Word95.doc has a Shift-JIS encoded font, but the text is all single byte English. Given that we can't map from fonts to text pieces, it isn't clear to me that this is actually what Shift-JIS looks like or if the English is really Times New Roman
Useful references: https://en.wikipedia.org/wiki/Code_page_950 ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
Created attachment 34899 [details] Another example file This file comes from the same source as the other attachment. It contained a few 0xf9xx characters that the original file did not.