I have been working on integration of POI with Lucene, mostly to get Word file indexing working well enough to fit my needs. Despite the fact that I still have some problems with some "complex" files, the result is acceptable for now. I must admit that my modifications are quite "hacky", and I'm not sure if they are fitted for an real patch. Anyway they work reasonably well for me so they might be useful to other (see results below). The modifications I've done are : - deactivate formatting parsing. I didn't need it so I commented out the "findFormatting" in the WordDocument class - small patches here and there to remove exceptions - modifications to fall-back to main stream document text if the parsing of the piece tables seemed to give nothing (it seems there are a lot of problems with some files here but I'm not knowledgeable about the format enough to know what I'm doing). And it seems the binary file format document is not telling us everything that is really going on here :( - modifications in the writeAllText method of the WordDocument - added @author tags in the modified files to comply with submission guidelines. The result I got : - I tested on the 384 Word files I found on my computer - 1 couldn't be parsed at all becuase of a signature problem (POIFS problem ?) - 3 were actually RTF files so they are ignored - 5 files seemed to have problem with piece tables. If I "Save As..." the files to transform into "simple" files the text extraction works fine. The piece table seemed to always point me to text after the value of fib.fcMax. Here I made a patch the reverts to the main document text stream in this case - 4 files had piece tables that covered some of the main document stream and some parts outside, which means I only got part of the text in my extractions. - the rest of the files worked very well ! I'm sorry to say that most of these files are not test cases I could send off just like this as some of the data is personal and/or not for public eyes. I also seemed to have problems with the test case files that were included in POI, that don't even work on the real MS Word ! Basically what I can do not is I have a class that has a method that looks like this : public String HDFExtractor.getHDFContent(File f); That gives me a String containing all the text of an HDF encoded file. I then index this into Lucene to do the text indexing. It doesn't work with every Word file I've encountered but it's better than nothing for me.
Created attachment 6423 [details] CVS diff patch to enable text extraction of HDF documents
I've been working on the exact same thing, and I came up with different fixes that lead to the same result, but without having to remove the "findFormatting" from the WordDocument class. I now have merged Serge's patch with mine. The differences between Serge's modifications and mine are: Utils.convertBytesToShort: patch to avoid an ArrayOutOfBoundsExceptions. WordDocument.printTable: patch to avoid a NullPointerException As of now, the only word documents that refuse to parse are the ones that throw the "Invalid header signature" error (see bug 11506 for the files). I may look into this in the future, but for now have no time to do so. Following this message you will find the resulting CVS Diff. Please bear in mind that my modifications, though working, are based only on fixes that seemed logical from a programming point of view (tests to avoid ArrayOutOfBoundsExceptions, etc..). I have _no_ knowledge of the Word file format and in the process might have done something stupid.
Created attachment 6629 [details] The new CVS diff, including Serge's modifications
All dev moved to HWPF