|Summary:||ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>|
|Product:||POI||Reporter:||Yurii Rashkovskii <yrashk>|
|Component:||POIFS||Assignee:||POI Developers List <dev>|
triggering file from govdocs1
list of files with similar exception in Tika's regression corpus
MPP file triggers error when read with MPXJ 7.0.2 (POI 3.17)
Description Yurii Rashkovskii 2017-10-27 10:49:40 UTC
I have an issue with a particular CFBF MSG file (unfortunately, due to non-disclosure and privacy issues, I can't show this file), I am running into an issue, where NPOIFSFileSytsem's/BlockStore's chain loop detector is failing with this kind of error: ``` java.lang.ArrayIndexOutOfBoundsException: -1557822031 at org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:441) ``` Upon inspection of the code, my understanding is that what happens here is that since int is signed (and in fact represents sector number within a valid range of 0x00000000 to 0xFFFFFFF9), the algorithm used in ChainLoopDetector is inherently flawed as it assumes that `offset` is always non-negative. However, if you look into `NPOIFSFileSystem.readCoreContents` I can see that the same value is also compared with the END_OF_CHAIN constant (reserved range), which means that the whole range from 0x00000000 to 0xFFFFFFF9 might be passed over to `ChainLoopDetector#claim`. Can anybody confirm if my hunch is correct? I'll gladly send a patch for the chain loop detector that'd work with `int` correctly.
Comment 1 Tim Allison 2017-10-27 13:07:51 UTC
Created attachment 35470 [details] triggering file from govdocs1 The attached file from govdocs1 shows a similar stacktrace. We have on the order of 100-200 ish documents in Tika's public regression set that trigger a similar trace. java.lang.ArrayIndexOutOfBoundsException: -65537 at org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99) at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:168) at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142) at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:81) at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:440) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:235) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:168) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:122) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)
Comment 2 Tim Allison 2017-10-27 13:10:03 UTC
I'm not extremely familiar with that chunk of code, but your hunch sounds right. Thank you very much for digging through our code to find the cause. Perhaps Nick might have an idea? Please do submit a patch and feel free to use the attached document for your unit test.
Comment 3 Tim Allison 2017-10-27 13:29:43 UTC
The earlier file I attached and nearly all of the other files that I've manually reviewed fail to open in MSWord. This ppt does open in MSPPT, but there are some issues with the images towards the end. Are you able to open your triggering file in Outlook?  http://220.127.116.11/docs/govdocs1/679/679046.ppt
Comment 4 Tim Allison 2017-10-27 13:41:49 UTC
Created attachment 35471 [details] list of files with similar exception in Tika's regression corpus prefix with http://18.104.22.168/docs to download.
Comment 5 Javen O'Neal 2017-10-27 15:56:48 UTC
Side-note: is anyone familiar with any Java developer tools that can analyze source or bytecode to find where an infinite loop could occur? If such a tool existed, would it be valuable to run this against our codebase as part of our test suite?
Comment 6 Nick Burch 2017-10-27 18:36:35 UTC
(In reply to Javen O'Neal from comment #5) > Side-note: is anyone familiar with any Java developer tools that can analyze > source or bytecode to find where an infinite loop could occur? If such a > tool existed, would it be valuable to run this against our codebase as part > of our test suite? Semmle / LGTM might be able to do this. There's a free version for open source projects - https://lgtm.com/
Comment 7 Yurii Rashkovskii 2017-10-31 04:35:54 UTC
Tim, Thanks a lot for supplying additional files. Indeed, it does seem like the doc files can't be opened by Word, so the point is moot there. The PPT file you attached doesn't seem to fail ChainLoopDetector at all. However, when I run it on my (unfortunately, undisclosable) file, it does trip ChainLoopDetector#claim with a negative index. However, once I replaced the boolean array with a HashSet<Integer> (to prevent the negative indices problem), I ran into a new issue. It looks like the problem is deeper than I originally thought. At this point it trips FileBackedDataSource and its use of FileChannel#position. Both expect longs, but since some of the legitimate offsets NPOIFSFileSystem#getBlockAt() receives are interpreted as negatives (since, again, int is signed, and the actual numbers in the format are unsigned integers) it ends up passing a negative offset to FileBackedDataSource, which, of course, fails. The most reasonable solution, it seems to me, is to rectify the problem at source. Using a signed integer where an unsigned one is expected is asking for trouble (and getting it). As this project should work on Java 1.6+, we can't use Java 8's Integer.toUnsignedLong, but we could use a very simple helper function for that. Following through, I tried to convert the signed ints to unsigned ints (within longs), only to get to further problem of FileBackedDataSource not be able to read any bytes at the given offset (returning its own IndexOutOfBoundsException). So at this point it feels to me as there are two potential reasons for this: 1) The file is corrupt but somehow handled fine by Microsoft Office. 2) POIFS' understanding of the format (could it be a compatibility issue of some kind?) is somehow wrong and the offsets read are not what they should be. Thoughts? --- Another issue is that it looks like the only file that seems to be working perfectly well for Microsoft Office AND using the full range of unsigned int32 values for offsets is just that MSG file I can't disclose. I didn't check ALL of your doc files, but the first sample I had all failed to be opened in Office. With that in mind, if we can't find any other file like this, what would be our course of action?
Comment 8 Yurii Rashkovskii 2017-10-31 05:54:31 UTC
I investigated this issue further. It looks like Outlook is able to render the email, one of the attachment is actually broken (too short). To me it feels like Outlook (or whatever library they use for working with CFBF) is rather lazy (won't read everything in upon initialization) and POIFS is effectively more eager. Do you think there is any truth to this and possible workarounds? I'd rather have a portion of that email recovered than have nothing at all (effectively, minic Outlook's behaviour)
Comment 9 Nick Burch 2017-11-02 07:45:40 UTC
We're moving to Java 8 for the next major release, so using Integer.toUnsignedLong might be possible going forward. However... I think there are some hard-baked assumptions in the code that no valid block will be beyond 2^31 mark. If that were to change, because of the way that Java arrays work (and we generally want to use Arrays not Maps/Sets for performance and memory utilisation), we'd have to change some places to potentially have a second array for the "negative signed ints" offsets For your problematic MSG file, how big is the file itself? And are these negative offsets occurring in the Mini Stream (small blocks) or the Normal / Main Stream (big blocks)? Can other tools (eg OpenOffice, libgsf) read the file ok? The spec <https://msdn.microsoft.com/en-us/library/dd942475.aspx> suggests that the maximum allowed size of a 512-blocksize file (eg MSG or DOC) is 2gb, so we shouldn't be anywhere near exceeding the 2^31 block index limit on big blocks. Even a mini stream (64 byte blocks for a 512 bigblock file) shouldn't, by my calculation, be able to exceed the 2^31 limit whilst still staying under 2gb for the overall filesize. I *think* that only 4096 byte block files should be able to get into the 0x80000000+ sector number range. Even if they did that with just the ministream, that's still a 128gb+ test file to trigger it, unless my maths is off!
Comment 10 Jon Iles 2017-12-21 12:45:57 UTC
I've hit a similar issue: Caused by: java.lang.ArrayIndexOutOfBoundsException: -3 at org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99) at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:168) at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142) at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:264) at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:162) at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:127) at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:122) Sadly I can't share the MPP file that triggers this.