Bug 17824

Summary: about reading ms. doc file
Product: POI Reporter: Taylan Ozgur YILDIRIM <tdyildirim>
Component: HDFAssignee: POI Developers List <dev>
Status: RESOLVED INVALID    
Severity: major    
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Hardware: Sun   
OS: other   

Description Taylan Ozgur YILDIRIM 2003-03-10 12:59:25 UTC
When i read a ms doc file with using HDF classes. I have got a big problem. If 
my data is not unicode and contains english char then there is no problem. But 
when i use unicode or utf-8 charset then i have a big problem. because when we 
use those type of charter string. It doesn't read all the data. it stopped to 
read some part of the data for example if i use something like inside of 
demo.doc document:  ýýüü 
then when we read we got ýýü
and it is increasing like this.

i will send my example given below

public class Deneme {

	public static void main(String[] args) {
		
	testDoc deneme = new testDoc("demo.doc","demo.txt");
	deneme.getText();
	}
}

-----------------------------
//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the moment)-------------
-------------------
//------------------------------------------------------------------------------
---------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;

import java.awt.*;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.util.LittleEndian;

class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;

testDoc(String origFileName, String tempFile) {
this.tempFile=tempFile;
this.origFileName=origFileName;
}

public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");
				
            

wd.writeAllText(out);
out.flush();
out.close();
}
catch (Exception eN) {
System.out.println("Error reading document:"+origFileName+"\n"+eN.toString());
}
} // end for getText

} // end of class
 
 
------------------------
the problem starts in 
wd.writeAllText(out);
when we look at the this method we see that end integer doesn't get the end 
point when we use unicode ms doc file..

Thank you for your supports.
Comment 1 Taylan Ozgur YILDIRIM 2003-03-12 05:43:13 UTC
fixxed by Raghuram Velega
I fixed the bug which was in the WordDocument.java class, replace the
following segemnt of the code:

 public void writeAllText(Writer out) throws IOException
  {
    int textStart = Utils.convertBytesToInt(_header, 0x18);
    int textEnd = Utils.convertBytesToInt(_header, 0x1c);
    ArrayList textPieces = findProperties(textStart, textEnd, 
_text.root);
    int size = textPieces.size();


    for(int x = 0; x < size ; x++)
    {
      TextPiece nextPiece = (TextPiece)textPieces.get(x);
      int start = nextPiece.getStart();
       int end = nextPiece.getEnd();
       boolean unicode = nextPiece.usesUnicode();
       char ch;
       if(unicode){
           for(int y = start; y < end +(end-start); y += 2){
               ch = (char)Utils.convertBytesToShort(_header, y);
               out.write(ch);
            }

       }else{
            for(int y = start; y < end ; y += 1){
                ch = (char) Utils.convertUnsignedByteToInt(_header[y]);
                out.write(ch);
            }
       }
    }


  }

It should work then,

thanks,
raghu
Comment 2 Andy Oliver 2003-03-12 13:35:15 UTC
Attach this as a patch per the instructions
http://jakarta.apache.org/poi/getinvolved/index.html 
Comment 3 Taylan Ozgur YILDIRIM 2003-03-12 14:17:20 UTC
it doesn't work long unicode word documents.
Comment 4 David Fisher 2009-11-19 21:18:29 UTC
HDF is not currently supported.