Bug 47304 - WordDocument uses platform default encoding
WordDocument uses platform default encoding
Status: RESOLVED FIXED
Product: POI
Classification: Unclassified
Component: HDF
3.5-dev
PC Mac OS X 10.4
: P2 normal (vote)
: ---
Assigned To: POI Developers List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2009-06-02 22:32 UTC by Jelmer Kuperus
Modified: 2015-03-22 13:34 UTC (History)
0 users



Attachments
example (21.50 KB, application/msword)
2009-06-02 22:33 UTC, Jelmer Kuperus
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jelmer Kuperus 2009-06-02 22:32:10 UTC
When using the following code to read the attached word document the text is not read correctly on macosx

WordDocument wordDoc = new WordDocument(new FileInputStream("test.doc"));

StringWriter docTextWriter = new StringWriter();
wordDoc.writeAllText(new PrintWriter(docTextWriter));
wordDoc.writeAllText(writer);
docTextWriter.close();

System.out.println(docTextWriter.toString());


The reason for this is that the  platform default encoding is used to read the document when the text found is not unicode while windows-1252 should be used

Here's the offending code

if(unicode)
{
 ....
}
else
{
   String sText = new String(_header, start, end-start);
   out.write(sText);
}

On windows the platform default encoding is windows-1252, on osx it's macroman

To fix this 


String sText = new String(_header, start, end-start);

should be changed to

String sText = new String(_header, start, end-start, "windows-1252");
Comment 1 Jelmer Kuperus 2009-06-02 22:33:23 UTC
Created attachment 23746 [details]
example
Comment 2 Dominik Stadler 2015-03-22 13:34:06 UTC
I have applied the suggested fix with r1668367, although WordDocument is deprecated nowadays in favour of WordExtractor and HWPFDocument. WordExtractor handles this correctly already as far as I see.