Bug 20060

Summary: [PATCH] HDF text extraction patch
Product: POI Reporter: Serge Huber <shuber2>
Component: HDFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P3    
Version: 2.0-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: CVS diff patch to enable text extraction of HDF documents
The new CVS diff, including Serge's modifications

Description Serge Huber 2003-05-20 14:15:08 UTC
I have been working on integration of POI with Lucene, mostly to get Word file
indexing working well enough to fit my needs. Despite the fact that I still have
some problems with some "complex" files, the result is acceptable for now.

I must admit that my modifications are quite "hacky", and I'm not sure if they
are fitted for an real patch. Anyway they work reasonably well for me so they
might be useful to other (see results below).

The modifications I've done are :
- deactivate formatting parsing. I didn't need it so I commented out the
"findFormatting" in the WordDocument class
- small patches here and there to remove exceptions
- modifications to fall-back to main stream document text if the parsing of the
piece tables seemed to give nothing (it seems there are a lot of problems with
some files here but I'm not knowledgeable about the format enough to know what
I'm doing). And it seems the binary file format document is not telling us
everything that is really going on here :(
- modifications in the writeAllText method of the WordDocument
- added @author tags in the modified files to comply with submission guidelines.

The result I got :
- I tested on the 384 Word files I found on my computer
- 1 couldn't be parsed at all becuase of a signature problem (POIFS problem ?)
- 3 were actually RTF files so they are ignored
- 5 files seemed to have problem with piece tables. If I "Save As..." the files
to transform into "simple" files the text extraction works fine. The piece table
seemed to always point me to text after the value of fib.fcMax. Here I made a
patch the reverts to the main document text stream in this case
- 4 files had piece tables that covered some of the main document stream and
some parts outside, which means I only got part of the text in my extractions.
- the rest of the files worked very well !

I'm sorry to say that most of these files are not test cases I could send off
just like this as some of the data is personal and/or not for public eyes. I
also seemed to have problems with the test case files that were included in POI,
that don't even work on the real MS Word !

Basically what I can do not is I have a class that has a method that looks like
this :

        public String HDFExtractor.getHDFContent(File f);

That gives me a String containing all the text of an HDF encoded file. I then
index this into Lucene to do the text indexing. It doesn't work with every Word
file I've encountered but it's better than nothing for me.
Comment 1 Serge Huber 2003-05-20 14:16:02 UTC
Created attachment 6423 [details]
CVS diff patch to enable text extraction of HDF documents
Comment 2 Thierry Guerin 2003-06-04 15:06:33 UTC
I've been working on the exact same thing, and I came up with different fixes 
that lead to the same result, but without having to remove 
the "findFormatting" from the WordDocument class. I now have merged Serge's 
patch with mine. The differences between Serge's modifications and mine are:
Utils.convertBytesToShort: patch to avoid an ArrayOutOfBoundsExceptions.
WordDocument.printTable: patch to avoid a NullPointerException
As of now, the only word documents that refuse to parse are the ones that 
throw the "Invalid header signature" error (see bug 11506 for the files). I 
may look into this in the future, but for now have no time to do so.
Following this message you will find the resulting CVS Diff.
Please bear in mind that my modifications, though working, are based only on 
fixes that seemed logical from a programming point of view (tests to avoid 
ArrayOutOfBoundsExceptions, etc..). I have _no_ knowledge of the Word file 
format and in the process might have done something stupid.
Comment 3 Thierry Guerin 2003-06-04 15:08:33 UTC
Created attachment 6629 [details]
The new CVS diff, including Serge's modifications
Comment 4 Andy Oliver 2003-07-24 16:37:27 UTC
All dev moved to HWPF