Bug 17824 - about reading ms. doc file
Summary: about reading ms. doc file
Alias: None
Product: POI
Classification: Unclassified
Component: HDF (show other bugs)
Version: unspecified
Hardware: Sun other
: P3 major (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2003-03-10 12:59 UTC by Taylan Ozgur YILDIRIM
Modified: 2009-11-19 21:18 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Taylan Ozgur YILDIRIM 2003-03-10 12:59:25 UTC
When i read a ms doc file with using HDF classes. I have got a big problem. If 
my data is not unicode and contains english char then there is no problem. But 
when i use unicode or utf-8 charset then i have a big problem. because when we 
use those type of charter string. It doesn't read all the data. it stopped to 
read some part of the data for example if i use something like inside of 
demo.doc document:  ýýüü 
then when we read we got ýýü
and it is increasing like this.

i will send my example given below

public class Deneme {

	public static void main(String[] args) {
	testDoc deneme = new testDoc("demo.doc","demo.txt");

//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the moment)-------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;

import java.awt.*;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.util.LittleEndian;

class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;

testDoc(String origFileName, String tempFile) {

public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");

catch (Exception eN) {
System.out.println("Error reading document:"+origFileName+"\n"+eN.toString());
} // end for getText

} // end of class
the problem starts in 
when we look at the this method we see that end integer doesn't get the end 
point when we use unicode ms doc file..

Thank you for your supports.
Comment 1 Taylan Ozgur YILDIRIM 2003-03-12 05:43:13 UTC
fixxed by Raghuram Velega
I fixed the bug which was in the WordDocument.java class, replace the
following segemnt of the code:

 public void writeAllText(Writer out) throws IOException
    int textStart = Utils.convertBytesToInt(_header, 0x18);
    int textEnd = Utils.convertBytesToInt(_header, 0x1c);
    ArrayList textPieces = findProperties(textStart, textEnd, 
    int size = textPieces.size();

    for(int x = 0; x < size ; x++)
      TextPiece nextPiece = (TextPiece)textPieces.get(x);
      int start = nextPiece.getStart();
       int end = nextPiece.getEnd();
       boolean unicode = nextPiece.usesUnicode();
       char ch;
           for(int y = start; y < end +(end-start); y += 2){
               ch = (char)Utils.convertBytesToShort(_header, y);

            for(int y = start; y < end ; y += 1){
                ch = (char) Utils.convertUnsignedByteToInt(_header[y]);


It should work then,

Comment 2 Andy Oliver 2003-03-12 13:35:15 UTC
Attach this as a patch per the instructions
Comment 3 Taylan Ozgur YILDIRIM 2003-03-12 14:17:20 UTC
it doesn't work long unicode word documents.
Comment 4 David Fisher 2009-11-19 21:18:29 UTC
HDF is not currently supported.