When using the following code to read the attached word document the text is not read correctly on macosx WordDocument wordDoc = new WordDocument(new FileInputStream("test.doc")); StringWriter docTextWriter = new StringWriter(); wordDoc.writeAllText(new PrintWriter(docTextWriter)); wordDoc.writeAllText(writer); docTextWriter.close(); System.out.println(docTextWriter.toString()); The reason for this is that the platform default encoding is used to read the document when the text found is not unicode while windows-1252 should be used Here's the offending code if(unicode) { .... } else { String sText = new String(_header, start, end-start); out.write(sText); } On windows the platform default encoding is windows-1252, on osx it's macroman To fix this String sText = new String(_header, start, end-start); should be changed to String sText = new String(_header, start, end-start, "windows-1252");
Created attachment 23746 [details] example
I have applied the suggested fix with r1668367, although WordDocument is deprecated nowadays in favour of WordExtractor and HWPFDocument. WordExtractor handles this correctly already as far as I see.