Bug 51902 - [PATCH] Picture.fillRawImageContent - ArrayIndexOutOfBoundsException
Summary: [PATCH] Picture.fillRawImageContent - ArrayIndexOutOfBoundsException
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
: 51890 (view as bug list)
Depends on:
Blocks: 51974
  Show dependency tree
Reported: 2011-09-27 23:48 UTC by Jeremy
Modified: 2011-10-12 13:00 UTC (History)
1 user (show)

Patch for issue (984 bytes, application/octet-stream)
2011-09-27 23:48 UTC, Jeremy
Replaces initial patch.. improved (985 bytes, patch)
2011-09-28 01:53 UTC, Jeremy
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy 2011-09-27 23:48:05 UTC
Created attachment 27617 [details]
Patch for issue

Found a handful of Word files that cause ArrayIndexOOB. (Unable to attach sample due to sensitive nature of files).

Patch included, essentially the pictureBytesStartOffset used for a System.arraycopy() is sometimes set to a negative value.

Fix uses an existing less-than check but also makes sure it's greater than zero before using the new value rather than the default PICTF1BlockOffset.

Stack Trace: (POI-3.8-beta4)

Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:363)
	at org.apache.poi.hwpf.usermodel.Picture.getRawContent(Picture.java:203)
	at org.apache.poi.hwpf.usermodel.Picture.fillImageContent(Picture.java:372)
	at org.apache.poi.hwpf.usermodel.Picture.getContent(Picture.java:191)
	at org.apache.poi.hwpf.usermodel.Picture.suggestPictureType(Picture.java:330)
	at org.apache.poi.hwpf.usermodel.Picture.suggestFileExtension(Picture.java:315)
	at org.apache.poi.hwpf.usermodel.Picture.suggestFullFileName(Picture.java:150)
	at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:504)
	at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:488)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:196)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 45 more
Comment 1 Jeremy 2011-09-28 01:53:51 UTC
Created attachment 27619 [details]
Replaces initial patch.. improved

Changes to greater-than or equal to 0
Comment 2 Sergey Vladimirov 2011-09-30 13:24:51 UTC
Please, provide example doc. It is possible we will be able to handle image correctly. You can send it to my email privately.
Comment 3 Sergey Vladimirov 2011-09-30 13:26:26 UTC
*** Bug 51890 has been marked as a duplicate of this bug. ***
Comment 4 Sergey Vladimirov 2011-09-30 15:50:23 UTC
Image loading is completely rewritten. Please, check r1177710 or later.
Comment 5 pqueixalos 2011-09-30 16:04:05 UTC
Works like a charm.

(In reply to comment #4)
> Image loading is completely rewritten. Please, check r1177710 or later.
Comment 6 Jeremy 2011-10-03 19:24:38 UTC
Agree, definitly works for the files I was having an issue with prior.  Thanks very much for your attention to this matter.
Comment 7 Jeremy 2011-10-12 13:00:50 UTC

Thanks for this re-work of the Picture handling logic for MS Word documents.  This seems to have fixed many of the random bugs that would pop-up across a large and varying data-set.  I did however uncover one bug that was introduced by these fixes, and have supplied a patch.  When you get a chance, could you please take a look at Bug 51974. (https://issues.apache.org/bugzilla/show_bug.cgi?id=51974)

It's essentially a null pointer exception that is encountered when parsing text via TIKA that was not present prior.

Thanks in advance,


(In reply to comment #4)
> Image loading is completely rewritten. Please, check r1177710 or later.