Unable to include sample document due to sensitive nature. If there any pointers for utilities that can further investigate the documents, let me know and I'll see what further information I can supply. A few of my documents are trying to perform an arraycopy with a length thats greater than the amount remaining in the stream buffer. File opens successfully in Word 2010, and may be older than a Word97 document. Documents likely has encoding from Hong Kong region. A couple produce the following Stack Trace (Daily Build) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:108) at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) More than a handful are caught earlier on and produce this stack trace: Caused by: java.lang.IllegalStateException: Told we're for characters 0 -> 6385, but actually covers 6373 characters! at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73) at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:115) at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
You can use the Binary File Format Validator to check files are valid, see http://poi.apache.org/faq.html#faq-N10109 Also, have you tried with a recent svn checkout / recent nightly build?
I'm currently using a nightly build now for pretty much all of my investigation, and have actually had a bit of luck with getting improvements submitted. The problem with many of these documents is that they are older versions of word likely from 1995-2001. And also have the possability of originating from Asian countries. The files aren't corrupt enough to the point where Word2010 can't open them... but thats not saying too much. I've encountered numerous header signature issues which I'm kind of avoiding all together since the largest % are from ~based files... though a few are able to be opened by word. I'll take a look at using the validator on a few of the files and see what I get in the next few days. BTW, thanks Nick for the help on the Outlook issue #51873 a week ago. If you get a chance can you revist my final msg there. There was a small bug in the patch you placed into the trunk for me. Thanks again. (In reply to comment #1) > You can use the Binary File Format Validator to check files are valid, see > http://poi.apache.org/faq.html#faq-N10109 > Also, have you tried with a recent svn checkout / recent nightly build?
Added link to bug 52349
This is a duplicate of Bug 50955 *** This bug has been marked as a duplicate of bug 50955 ***