Bug 51946 - [BUG] TextPieceTable <init> ArrayIndexOutOfBoundsException and IllegalStateException - Hong Kong encoding?
Summary: [BUG] TextPieceTable <init> ArrayIndexOutOfBoundsException and IllegalStateEx...
Status: RESOLVED DUPLICATE of bug 50955
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-04 00:27 UTC by Jeremy
Modified: 2012-11-05 15:54 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy 2011-10-04 00:27:55 UTC
Unable to include sample document due to sensitive nature.

If there any pointers for utilities that can further investigate the documents, let me know and I'll see what further information I can supply.

A few of my documents are trying to perform an arraycopy with a length thats greater than the amount remaining in the stream buffer.  File opens successfully in Word 2010, and may be older than a Word97 document.  Documents likely has encoding from Hong Kong region.


A couple produce the following Stack Trace (Daily Build)
Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:108)
	at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71)
	at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)



More than a handful are caught earlier on and produce this stack trace:
Caused by: java.lang.IllegalStateException: Told we're for characters 0 -> 6385, but actually covers 6373 characters!
	at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73)
	at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:115)
	at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:71)
	at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
Comment 1 Nick Burch 2011-10-04 08:52:55 UTC
You can use the Binary File Format Validator to check files are valid, see http://poi.apache.org/faq.html#faq-N10109

Also, have you tried with a recent svn checkout / recent nightly build?
Comment 2 Jeremy 2011-10-04 20:00:26 UTC
I'm currently using a nightly build now for pretty much all of my investigation, and have actually had a bit of luck with getting improvements submitted.

The problem with many of these documents is that they are older versions of word likely from 1995-2001.  And also have the possability of originating from Asian countries.

The files aren't corrupt enough to the point where Word2010 can't open them... but thats not saying too much.  I've encountered numerous header signature issues which I'm kind of avoiding all together since the largest % are from ~based files... though a few are able to be opened by word.

I'll take a look at using the validator on a few of the files and see what I get in the next few days.


BTW, thanks Nick for the help on the Outlook issue #51873 a week ago.  If you get a chance can you revist my final msg there.  There was a small bug in the patch you placed into the trunk for me.


Thanks again.


(In reply to comment #1)
> You can use the Binary File Format Validator to check files are valid, see
> http://poi.apache.org/faq.html#faq-N10109
> Also, have you tried with a recent svn checkout / recent nightly build?
Comment 3 Jeremy 2011-12-27 17:06:08 UTC
Added link to bug 52349
Comment 4 Sergey Vladimirov 2012-11-05 15:54:37 UTC
This is a duplicate of Bug 50955

*** This bug has been marked as a duplicate of bug 50955 ***