Summary: | Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88 | ||
---|---|---|---|
Product: | POI | Reporter: | sharathkumarmn |
Component: | HPBF | Assignee: | POI Developers List <dev> |
Status: | NEW --- | ||
Severity: | normal | CC: | gaurav.chd3 |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | test document |
Knowing nothing about the Compound File Binary Format (is this the same as or a predecessor to OLE2 containers?) [1.2] CHNKINK record offset = 0x8200 QC Bit offset = 0x8340 - 0x8200 = 0x0140 Annotated contents of data[offset:offset+24]: +0 | +2 | +6 | +8 | +10 | +12 | +16 | +20 recID | thingType | optA | optB | optC | bitType | from | len 00008340 18 00 | 54 4f 4b 4e | 00 00 | 01 00 | 00 00 | 50 4c 43 20 | 32 62 00 00 | 58 00 00 00 data QCBit | "TOKN" | false | true | false | "PLC " | 0x6232 | 0x58 = 88 bytes Location Len Hex Value Field Meaning (Little Endian conv, ASCII, hex to dec, etc) 00008200+00 [8] 43 48 4e 4b 49 4e 4b 20 "CHNKINK " ... 00008340+00 [2] 18 00 QC Bit recID 00008340+02 [4] 54 4f 4b 4e thingType "TOKN" 00008340+06 [2] 00 00 optA 0x0000 -> false 00008340+08 [2] 01 00 optB 0x0001 -> true 00008340+10 [2] 00 00 optC 0x0000 -> false 00008340+12 [4] 50 4c 43 20 bitType "PLC " 00008340+16 [4] 32 62 00 00 data from 0x6232, the byte offset from the beginning of the CHNKINK record at 0x8200 00008340+20 [4] 58 00 00 00 data len 0x58 = 88 bytes ... And the raw QCPLCBit record at 0x8200+0x6232=0xe432: 0000e430 03 00 00 00 0c 00 00 00 ff ff 01 00 06 01 |..............| 0000e440 00 00 11 01 00 00 4e 07 00 00 5a 07 00 00 16 00 |......N...Z.....| 0000e450 00 00 00 22 00 06 00 00 01 22 09 00 00 00 02 22 |..."....."....."| 0000e460 07 00 00 00 0a 00 00 00 01 22 0f 00 00 00 0a 00 |........."......| 0000e470 00 00 01 22 0a 00 00 00 0a 00 00 00 00 22 ff ff |..."........."..| 0000e480 ff ff 04 00 00 00 04 00 00 00 |..........| Interpreting the QCPLCBit: 0000e432+0 03 00 00 00 3 number of PLCs 0000e432+4 0c 00 00 00 Type12 (holds hyperlinks, complicated) type of PLCs ... The QC Bit header specifies the QC PLC Bit record has a length of 88 bytes. The QCPLCBit specifies it contains 3 hyperlink PLCs (Type 12, 0x0c). From how I interpret the current code, there's no way that 3 PLC hyperlinks can be specified in 88 bytes. > oneStartsAt = 0x4c > twoStartsAt = 0x68 > threePlusIncrement = 22 Therefore three probably starts at 0x68+22=0x7e and ends at 0x68+22*2=0x94 With 0x58=88 bytes, there aren't even enough bytes for a second, let alone a third PLC. I guess we'd have to consult [MS-CFB][2] to figure out if this QCPLCBit record really can be 88 bytes long or if the file is corrupt and silently skips over reading these hyperlinks. [1] https://en.wikipedia.org/wiki/Compound_File_Binary_Format [2] https://msdn.microsoft.com/en-us/library/dd942138.aspx The last real change to supporting HPBF hyperlinks was nearly 9 years ago, and even then the commit message indicated partial hyperlink support. So it's quite likely that we haven't fully implemented all hyperlink variations. I expected to see some hyperlink URL as a string in the hexdump, but perhaps this is a hyperlink to another element within the document. Nonetheless, there are some nuggets of insight in the comments and variable names to figure out what's going on in this QC PLC hyperlink bit. https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hpbf/model/qcbits/QCPLCBit.java?r1=690729&r2=690534 The Microsoft Publisher binary .pub format is undocumented, as indicated here: https://poi.apache.org/hpbf/index.html OpenOffice/LibreOffice doesn't have documentation or an open source application that reads this .pub format, to my knowledge, so that means we'd have to resort to figuring out the format through lots of hard work. Assuming the file you have provided is valid (opens without warnings or errors in Microsoft Publisher), if you're mostly interested in text extraction, then skipping over this hyperlink is probably preferable over throwing an exception. We can log the error that we catch and move forward with extraction. Workaround applied in r1801405. Will be included in POI 3.17 beta 2. Looking for any volunteers willing to experiment with the .pub format and extend our documented understanding here: https://poi.apache.org/hpbf/file-format.html Looks like we have ~8500 publisher files in our regression corpus if those would be of any interest. Some are likely truncated...so it goes w Common Crawl. |
Created attachment 34710 [details] test document When i try to parse the attached .pub file, it fails with the below exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 88 at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343) at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:215) at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:176) at org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90) at org.apache.poi.hpbf.model.QuillContents.<init>(QuillContents.java:71) at org.apache.poi.hpbf.HPBFDocument.<init>(HPBFDocument.java:67) at org.apache.poi.hpbf.extractor.PublisherTextExtractor.<init>(PublisherTextExtractor.java:45) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 28 more