On TIKA-2142, Seva Alekseyev shared a PPT that causes an AIOOBE when reading the "Pictures" stream. When the exception is hit, the parser has read in 81 pictures. However, when we open the file in PPT and count the number of images and when we count the number of HSLFPictureShapes we get when we iterate through the slides programmatically, there are only 78 images. If I understand correctly, when we read the "Pictures" stream, we don't know how many records to expect. We read the Pictures stream before we parse the slides and determine which are most recent etc., and there is no header on the Picture stream (OfficeArtBStoreDelay). I propose we check for an AIOOBE in 'readPictures' within HSLFSlideShowImpl and log a warning if we would have hit one.
r1767023 I did a bit of testing, and if the pictstream is truncated on an image that _should_ appear in a slide, the consumer will get a null when retrieving the missing image's pictdata.
The wrong imagesize is based on a premature end of reading the last picture and then the imagesize is just some garbage at the end of that picture. We shouldn't yet close this issue ...
Doh. Thank you.
Does that mean that the previous image's length is wrong? If I understand correctly, there are two faulty png headers at the end of the stream. The first one says the image should be 414273 (065241) bytes long. Then ~75 bytes later, there's another PNG signature, and its header says that the image should be 273665 (042D01). If I copy and paste from the second PNG signature to the end (042D09), I get a png file that I can open. If I actually truncate it to 273665 (042D01), I get a corrupt file.
I can confirm your comment #5, i.e. I was wrong in #3. I guess this truncated picture is a rare case, and we don't need to try harder to find the next pictures in the sequence. If this would happen more often in the corpus, we could try to check, if we point beyond EOF (= the current patch) or if the referenced address contains a invalid signature. In both cases, we could then scan through the byte array to find a valid signature and continue from there. I guess you don't object, if I close the bug with your resolution.
No objection resolving as fixed. I agree that the image stream is truly corrupt in this one file. We will likely run into cases where the image stream is corrupted/truncated _and_ there is actually a reference to the corrupted/truncated in a slide. In that case, there will be an NPE when the user tries to get the image data. Given the craziness in the world, I'm ok with that. I'd far prefer to prevent an OOM/AIOOBE if the data is corrupt. Thank you for looking into this!