Bug 60685 - Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88
Summary: Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HPBF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-03 04:13 UTC by sharathkumarmn
Modified: 2017-07-10 12:53 UTC (History)
1 user (show)



Attachments
test document (116.50 KB, application/vnd.ms-publisher)
2017-02-03 04:13 UTC, sharathkumarmn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sharathkumarmn 2017-02-03 04:13:08 UTC
Created attachment 34710 [details]
test document

When i try to parse the attached .pub file, it fails with the below exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: 88
at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:215)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:176)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90)
at org.apache.poi.hpbf.model.QuillContents.<init>(QuillContents.java:71)
at org.apache.poi.hpbf.HPBFDocument.<init>(HPBFDocument.java:67)
at org.apache.poi.hpbf.extractor.PublisherTextExtractor.<init>(PublisherTextExtractor.java:45)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 28 more
Comment 1 Javen O'Neal 2017-07-09 23:51:27 UTC
Knowing nothing about the Compound File Binary Format (is this the same as or a predecessor to OLE2 containers?) [1.2]

CHNKINK record offset = 0x8200
QC Bit offset = 0x8340 - 0x8200 = 0x0140
Annotated contents of data[offset:offset+24]:
          +0    | +2          | +6    | +8    | +10   | +12         | +16         | +20
          recID | thingType   | optA  | optB  | optC  | bitType     | from        | len
00008340  18 00 | 54 4f 4b 4e | 00 00 | 01 00 | 00 00 | 50 4c 43 20 | 32 62 00 00 | 58 00 00 00
data      QCBit | "TOKN"      | false | true  | false | "PLC "      | 0x6232      | 0x58 = 88 bytes


Location    Len Hex Value    Field      Meaning (Little Endian conv, ASCII, hex to dec, etc)
00008200+00 [8] 43 48 4e 4b 49 4e 4b 20 "CHNKINK "
...
00008340+00 [2] 18 00        QC Bit recID
00008340+02 [4] 54 4f 4b 4e  thingType  "TOKN"
00008340+06 [2] 00 00        optA       0x0000 -> false
00008340+08 [2] 01 00        optB       0x0001 -> true
00008340+10 [2] 00 00        optC       0x0000 -> false
00008340+12 [4] 50 4c 43 20  bitType    "PLC "
00008340+16 [4] 32 62 00 00  data from  0x6232, the byte offset from the beginning of the CHNKINK record at 0x8200
00008340+20 [4] 58 00 00 00  data len   0x58 = 88 bytes
...
And the raw QCPLCBit record at 0x8200+0x6232=0xe432:
0000e430        03 00 00 00 0c 00  00 00 ff ff 01 00 06 01    |..............|
0000e440  00 00 11 01 00 00 4e 07  00 00 5a 07 00 00 16 00  |......N...Z.....|
0000e450  00 00 00 22 00 06 00 00  01 22 09 00 00 00 02 22  |..."....."....."|
0000e460  07 00 00 00 0a 00 00 00  01 22 0f 00 00 00 0a 00  |........."......|
0000e470  00 00 01 22 0a 00 00 00  0a 00 00 00 00 22 ff ff  |..."........."..|
0000e480  ff ff 04 00 00 00 04 00  00 00                    |..........|

Interpreting the QCPLCBit:
0000e432+0  03 00 00 00   3       number of PLCs
0000e432+4  0c 00 00 00   Type12 (holds hyperlinks, complicated) type of PLCs
...

The QC Bit header specifies the QC PLC Bit record has a length of 88 bytes.
The QCPLCBit specifies it contains 3 hyperlink PLCs (Type 12, 0x0c).
From how I interpret the current code, there's no way that 3 PLC hyperlinks can be specified in 88 bytes.
> oneStartsAt = 0x4c
> twoStartsAt = 0x68
> threePlusIncrement = 22
Therefore three probably starts at 0x68+22=0x7e and ends at 0x68+22*2=0x94
With 0x58=88 bytes, there aren't even enough bytes for a second, let alone a third PLC.

I guess we'd have to consult [MS-CFB][2] to figure out if this QCPLCBit record really can be 88 bytes long or if the file is corrupt and silently skips over reading these hyperlinks.

[1] https://en.wikipedia.org/wiki/Compound_File_Binary_Format
[2] https://msdn.microsoft.com/en-us/library/dd942138.aspx
Comment 2 Javen O'Neal 2017-07-10 00:11:18 UTC
The last real change to supporting HPBF hyperlinks was nearly 9 years ago, and even then the commit message indicated partial hyperlink support. So it's quite likely that we haven't fully implemented all hyperlink variations.

I expected to see some hyperlink URL as a string in the hexdump, but perhaps this is a hyperlink to another element within the document.

Nonetheless, there are some nuggets of insight in the comments and variable names to figure out what's going on in this QC PLC hyperlink bit.
https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hpbf/model/qcbits/QCPLCBit.java?r1=690729&r2=690534
Comment 3 Javen O'Neal 2017-07-10 01:28:31 UTC
The Microsoft Publisher binary .pub format is undocumented, as indicated here: https://poi.apache.org/hpbf/index.html

OpenOffice/LibreOffice doesn't have documentation or an open source application that reads this .pub format, to my knowledge, so that means we'd have to resort to figuring out the format through lots of hard work.

Assuming the file you have provided is valid (opens without warnings or errors in Microsoft Publisher), if you're mostly interested in text extraction, then skipping over this hyperlink is probably preferable over throwing an exception.
We can log the error that we catch and move forward with extraction.
Comment 4 Javen O'Neal 2017-07-10 01:38:26 UTC
Workaround applied in r1801405. Will be included in POI 3.17 beta 2.

Looking for any volunteers willing to experiment with the .pub format and extend our documented understanding here: https://poi.apache.org/hpbf/file-format.html
Comment 5 Tim Allison 2017-07-10 12:53:16 UTC
Looks like we have ~8500 publisher files in our regression corpus if those would be of any interest.  Some are likely truncated...so it goes w Common Crawl.