Bug 51891

Summary:	[PATCH] Fix StringIndexOutOfBoundsException : Ole10Native.<init> (parsing word file)
Product:	POI	Reporter:	pqueixalos <pablo.queixalos>
Component:	POIFS	Assignee:	POI Developers List <dev>
Status:	RESOLVED FIXED
Severity:	normal	CC:	bonniot
Priority:	P2	Keywords:	PatchAvailable
Version:	3.8-dev
Target Milestone:	---
Hardware:	PC
OS:	All
Attachments:	Patch adding testcases and proposed fix Updated patch using the storage class ID method

Description pqueixalos 2011-09-26 10:20:59 UTC

Encoutered with poi r1175705 (Through Tika)

java.lang.StringIndexOutOfBoundsException: String index out of range: -57
	at java.lang.String.checkBounds(String.java:397)
	at java.lang.String.<init>(String.java:442)
	at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:133)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:85)

Comment 1 pqueixalos 2011-09-26 10:25:16 UTC

Attachment to big (4.6MB) to be added. 
http://icp.ge.ch/sem/f30405/IMG/doc/multimedia.doc

Comment 2 Daniel Bonniot 2013-08-28 11:20:31 UTC

Created attachment 30774 [details]
Patch adding testcases and proposed fix

Comment 3 Daniel Bonniot 2013-08-28 11:20:52 UTC

I also experience Ole10Native failing in various ways. Note that in Bug 46392, which initially introduced the Ole10Native class, Trejkaz attached several Ole streams which also fail with the current code.

What  seems clear is that a \1Ole10Native stream starts with 4 bytes giving the size of the rest of the stream. After that, the current Ole10Native class distinguishes between the "plain" mode where all the rest is simply the attached data, and the "structured" mode where a structured format is present, including several unknown flags, the fileName and the attached data proper. What is also clear is that currently the code fails in the "structured" mode for some inputs.

I attach another failing case (oleObject2.bin) for which I know (given the data) that the "plain" mode should be used, but currently isn't (leading to a failure in the "structured" parsing code).

I could not find a definitive specification of this "structured" mode and when it should be used. Analysis of the examples at hand lead me to tweak the logic (and use the "plain" mode more often), but this is purely speculative at this point. Any further insight would be welcome. Still, this avoids failures in all the cases from Bug 46392, from Bug 51891 and for my own testcase.

I attach a patch that adds all these testcases, and changes the decision to use plain mode or not. Review would be welcome.

Comment 4 Daniel Bonniot 2013-08-28 14:21:12 UTC

I think I might have cracked this nut. The two inputs from Bug 46392 have the same storage class ID:
0003000C-0000-0000-C000-000000000046

It's hard to find much information about this class ID, but it seems to be associated with some kind of "Package" (see for instance http://www.lookas.net/ftp/Software/Tools/LitWin_98/regist~1.reg). This in turn seems to suggest that the parsing done by Ole10Native might actually be valid only for this specific kind of content. If that's indeed the case, we can change the logic to use always "plain", except for content with exactly this storage class ID. This still passes all the known test cases, and feels much more right than the previous attempt.

I'll attach a new patch. It uses this new logic, and also adds one more test case from https://issues.apache.org/jira/browse/TIKA-1072 which is also fixed by this.

Note that this suggest the "structured" parsing done by Ole10Native might not belong here at all, since it is tied to a specific content, but would logically belong to the client of POI. However I might be wrong here, and it also does not cost much to keep providing this feature, instead of breaking it and the corresponding API.

Comment 5 Daniel Bonniot 2013-08-28 14:22:23 UTC

Created attachment 30776 [details]
Updated patch using the storage class ID method

Comment 6 Arjohn Kampman 2014-01-28 13:49:49 UTC

We ran into the same problem with the first couple of bytes being removed from Ole10Native streams. We have checked some 30 to 40 files and can confirm that the check for the storage class ID solves the issue. We've also noticed that all ole10native streams that have the additional info at the start, also have a CompObj entry with the string "Package" in it, like so:

Document: "CompObj" size = 80
   00000000 01 00 FE FF 03 0A 00 00 FF FF FF FF 0C 00 03 00 ................
   00000010 00 00 00 00 C0 00 00 00 00 00 00 46 08 00 00 00 ...........F....
   00000020 50 61 63 6B 61 67 65 00 08 00 00 00 50 61 63 6B Package.....Pack
   00000030 61 67 65 00 08 00 00 00 50 61 63 6B 61 67 65 00 age.....Package.
   00000040 F4 39 B2 71 00 00 00 00 00 00 00 00 00 00 00 00 .9.q............
 
The storage class ID check looks easier and more robust though.

Any idea when this patch will be applied to the trunk?

Comment 7 Andreas Beeker 2014-01-30 00:26:02 UTC

I'll have a look at the patch. When I've implemented the patch for #55578, I haven't looked for information for the "plain" Ole10Native encoding, but at least for the Ole10 package objects, which I've generated by Office 2003/2010, this information was structured (i.e. not plain).

This patch won't go anymore into 3.10 final, but I'm quite sure it will be in 3.11-beta1.

Comment 8 Andreas Beeker 2014-02-01 21:49:33 UTC

Thank you for the patch - applied with r1563483
I've changed quite a bit to the proposed patch.
If you have even more ole10 files, please add them to the bug and I'll try-and-error again ...
But for now I'm closing it.