When i read a ms doc file with using HDF classes. I have got a big problem. If my data is not unicode and contains english char then there is no problem. But when i use unicode or utf-8 charset then i have a big problem. because when we use those type of charter string. It doesn't read all the data. it stopped to read some part of the data for example if i use something like inside of demo.doc document: ýýüü then when we read we got ýýü and it is increasing like this. i will send my example given below public class Deneme { public static void main(String[] args) { testDoc deneme = new testDoc("demo.doc","demo.txt"); deneme.getText(); } } ----------------------------- //------- this code writes doc file to txt----------- //------go get hfd libs from jakarta.poi (scratchpad at the moment)------------- ------------------- //------------------------------------------------------------------------------ --------------- import org.apache.poi.hdf.extractor.util.*; import org.apache.poi.hdf.extractor.data.*; import org.apache.poi.hdf.extractor.*; import java.util.*; import java.io.*; import javax.swing.*; import java.awt.*; import org.apache.poi.poifs.filesystem.POIFSFileSystem; import org.apache.poi.poifs.filesystem.POIFSDocument; import org.apache.poi.poifs.filesystem.DocumentEntry; import org.apache.poi.util.LittleEndian; class testDoc extends Deneme{ String origFileName; String tempFile; WordDocument wd; testDoc(String origFileName, String tempFile) { this.tempFile=tempFile; this.origFileName=origFileName; } public void getText() { try { wd = new WordDocument(origFileName); //Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8"); wd.writeAllText(out); out.flush(); out.close(); } catch (Exception eN) { System.out.println("Error reading document:"+origFileName+"\n"+eN.toString()); } } // end for getText } // end of class ------------------------ the problem starts in wd.writeAllText(out); when we look at the this method we see that end integer doesn't get the end point when we use unicode ms doc file.. Thank you for your supports.
fixxed by Raghuram Velega I fixed the bug which was in the WordDocument.java class, replace the following segemnt of the code: public void writeAllText(Writer out) throws IOException { int textStart = Utils.convertBytesToInt(_header, 0x18); int textEnd = Utils.convertBytesToInt(_header, 0x1c); ArrayList textPieces = findProperties(textStart, textEnd, _text.root); int size = textPieces.size(); for(int x = 0; x < size ; x++) { TextPiece nextPiece = (TextPiece)textPieces.get(x); int start = nextPiece.getStart(); int end = nextPiece.getEnd(); boolean unicode = nextPiece.usesUnicode(); char ch; if(unicode){ for(int y = start; y < end +(end-start); y += 2){ ch = (char)Utils.convertBytesToShort(_header, y); out.write(ch); } }else{ for(int y = start; y < end ; y += 1){ ch = (char) Utils.convertUnsignedByteToInt(_header[y]); out.write(ch); } } } } It should work then, thanks, raghu
Attach this as a patch per the instructions http://jakarta.apache.org/poi/getinvolved/index.html
it doesn't work long unicode word documents.
HDF is not currently supported.