Bug 47304

Summary: WordDocument uses platform default encoding
Product: POI Reporter: Jelmer Kuperus <jelmer>
Component: HDFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.5-dev   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X 10.4   
Attachments: example

Description Jelmer Kuperus 2009-06-02 22:32:10 UTC
When using the following code to read the attached word document the text is not read correctly on macosx

WordDocument wordDoc = new WordDocument(new FileInputStream("test.doc"));

StringWriter docTextWriter = new StringWriter();
wordDoc.writeAllText(new PrintWriter(docTextWriter));
wordDoc.writeAllText(writer);
docTextWriter.close();

System.out.println(docTextWriter.toString());


The reason for this is that the  platform default encoding is used to read the document when the text found is not unicode while windows-1252 should be used

Here's the offending code

if(unicode)
{
 ....
}
else
{
   String sText = new String(_header, start, end-start);
   out.write(sText);
}

On windows the platform default encoding is windows-1252, on osx it's macroman

To fix this 


String sText = new String(_header, start, end-start);

should be changed to

String sText = new String(_header, start, end-start, "windows-1252");
Comment 1 Jelmer Kuperus 2009-06-02 22:33:23 UTC
Created attachment 23746 [details]
example
Comment 2 Dominik Stadler 2015-03-22 13:34:06 UTC
I have applied the suggested fix with r1668367, although WordDocument is deprecated nowadays in favour of WordExtractor and HWPFDocument. WordExtractor handles this correctly already as far as I see.