Created attachment 34282 [details] triggering file While working TIKA-2069, I got an AIOOBE on a test file that I generated by taking the docm that Jeff Swindle submitted and saving as .doc. I confirmed this AIOOBE in pure POI: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:144) at org.apache.poi.util.RLEDecompressingInputStream.<init>(RLEDecompressingInputStream.java:77) at org.apache.poi.poifs.macros.VBAMacroReader.readModule(VBAMacroReader.java:204) at org.apache.poi.poifs.macros.VBAMacroReader.readMacros(VBAMacroReader.java:308) at org.apache.poi.poifs.macros.VBAMacroReader.findMacros(VBAMacroReader.java:155) at org.apache.poi.poifs.macros.VBAMacroReader.findMacros(VBAMacroReader.java:160) at org.apache.poi.poifs.macros.VBAMacroReader.findMacros(VBAMacroReader.java:160) at org.apache.poi.poifs.macros.VBAMacroReader.readMacros(VBAMacroReader.java:116) at org.apache.poi.poifs.macros.VBAMacroExtractor.extract(VBAMacroExtractor.java:83) at org.apache.poi.poifs.macros.VBAMacroExtractor.extract(VBAMacroExtractor.java:123) at org.apache.poi.poifs.macros.VBAMacroExtractor.main(VBAMacroExtractor.java:54)
Same exception with the original .docm file that Jeff submitted on TIKA-2069
I added a failing unit test to POI in r1761652 using test-macro-doc.docm from TIKA-2069 [1] submitted by Jeff Swindle [1] https://issues.apache.org/jira/browse/TIKA-2069
Slightly less than 50% of the macro exceptions are caused by this. See xlsx reports on https://issues.apache.org/jira/browse/TIKA-2104.
I think this is a problem in RLEDecompressingInputStream. In readChunk(), under if ((tokenFlags & POWER2[n]) == 0) { if the int that is read is 'ff', when that gets cast to a byte, its value becomes -1. When we try to readInt() to get the module offset, the first byte returns '-1' and we think we've hit the end of the stream and return -1.
r1765433 I modified RLEDecompressingInputStream's read() from return buf[pos++]; to return buf[pos++] & 0xFF; Let me know if we need to modify anything else in RLEDecompressingInputStream...or if there's a better place to fix this.