Bug 60305

Summary: Gracefully handle AIOOBE in Pictures stream in HSLF
Product: POI Reporter: Tim Allison <tallison>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows NT   

Description Tim Allison 2016-10-25 12:56:17 UTC
On TIKA-2142, Seva Alekseyev shared a PPT that causes an AIOOBE when reading the "Pictures" stream.  When the exception is hit, the parser has read in 81 pictures.  However, when we open the file in PPT and count the number of images and when we count the number of HSLFPictureShapes we get when we iterate through the slides programmatically, there are only 78 images.

If I understand correctly, when we read the "Pictures" stream, we don't know how many records to expect.  We read the Pictures stream before we parse the slides and determine which are most recent etc., and there is no header on the Picture stream (OfficeArtBStoreDelay). 

I propose we check for an AIOOBE in 'readPictures' within HSLFSlideShowImpl and log a warning if we would have hit one.
Comment 1 Tim Allison 2016-10-28 13:40:51 UTC
r1767023

I did a bit of testing, and if the pictstream is truncated on an image that _should_ appear in a slide, the consumer will get a null when retrieving the missing image's pictdata.
Comment 2 Tim Allison 2016-10-28 13:41:17 UTC
r1767023

I did a bit of testing, and if the pictstream is truncated on an image that _should_ appear in a slide, the consumer will get a null when retrieving the missing image's pictdata.
Comment 3 Andreas Beeker 2016-10-28 22:07:49 UTC
The wrong imagesize is based on a premature end of reading the last picture and then the imagesize is just some garbage at the end of that picture.
We shouldn't yet close this issue ...
Comment 4 Tim Allison 2016-10-28 22:14:03 UTC
Doh.  Thank you.
Comment 5 Tim Allison 2016-10-31 20:04:47 UTC
Does that mean that the previous image's length is wrong?

If I understand correctly, there are two faulty png headers at the end of the stream.  The first one says the image should be 414273 (065241) bytes long.  Then ~75 bytes later, there's another PNG signature, and its header says that the image should be 273665 (042D01).  If I copy and paste from the second PNG signature to the end (042D09), I get a png file that I can open.  If I actually truncate it to 273665 (042D01), I get a corrupt file.
Comment 6 Andreas Beeker 2016-11-02 23:12:25 UTC
I can confirm your comment #5, i.e. I was wrong in #3.
I guess this truncated picture is a rare case, and we don't need to try harder  to find the next pictures in the sequence.

If this would happen more often in the corpus, we could try to check, if we point beyond EOF (= the current patch) or if the referenced address contains a invalid signature. In both cases, we could then scan through the byte array to find a valid signature and continue from there.

I guess you don't object, if I close the bug with your resolution.
Comment 7 Tim Allison 2016-11-03 11:33:28 UTC
No objection resolving as fixed.  

I agree that the image stream is truly corrupt in this one file. We will likely run into cases where the image stream is corrupted/truncated _and_ there is actually a reference to the corrupted/truncated in a slide.  In that case, there will be an NPE when the user tries to get the image data. 

Given the craziness in the world, I'm ok with that.  I'd far prefer to prevent an OOM/AIOOBE if the data is corrupt.

Thank you for looking into this!