52863 – java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Bug 52863 - java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize

Summary: java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOper...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-03-09 02:14 UTC by HarrySimons
Modified:	2012-11-05 16:06 UTC (History)
CC List:	3 users (show)

Attachments
The same problem with MS PowerPoint files (41.43 KB, application/x-zip-compressed) 2012-04-06 16:23 UTC, Sepp	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description HarrySimons 2012-03-09 02:14:30 UTC

1. When converting a bunch of Microsoft Word documents using the command,

    java -jar tika-app-1.1-SNAPSHOT.jar -v -t

, I'm getting the following exception. Ditto with Tika 1.1 release candidate.

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5d3ac0
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
    at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
    at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
    at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
    at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48)
    at org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67)
    at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103)
    at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943)
    at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146)
    at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    ... 4 more

A user, Nick Burch, has advised me to raise this as a POI bug.

2. Here's the output of the BFF Validator tool:

<BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED">
<ParseStack>
<Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737">
<Info>Built-in type "Docfile": The root storage object of an OLE compound file. For more information, see http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info>
</Type>
<Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0">
<Info>Built-in type "Stream": Any stream object for OLE compound files. The entire file contents for other files.</Info>
</Type>
<Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1" msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2" msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type builtinType="USHORT" streamName="WordDocument" bitfield="True" bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4" streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10" hexStreamOffset="0xa" childId="10" hexChildId="0xa">
<Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info>
</Type>
</ParseStack>
<LastData><![CDATA[
EC A5 01 01 4D 20 09 04  00 00 08 12 BF 00 00 00  ....M...........
00 00 00 30 00 00 00 00  00 08 00 00 66 EF 00 00  ...0........f...
]]></LastData>
</BFFValidation>
--------------------------------------------

Would greatly appreciate a timely fix, as I have 2000+ of documents that POI/Tika are failing on. I cannot proceed any further.

Comment 1 Yegor Kozlov 2012-03-09 11:33:23 UTC

Can you upload a failing document? 

Yegor

Comment 2 HarrySimons 2012-03-09 12:14:24 UTC

Unfortunately, this is a classified document.

Comment 3 Yegor Kozlov 2012-03-09 12:17:03 UTC

Do you know the origin of these failing docs? Were they created by MS Word or by OpenOffice or by what ? 

Without a sample file we can't do much.

Yegor

(In reply to comment #2)
> Unfortunately, this is a classified document.

Comment 4 HarrySimons 2012-03-10 01:33:39 UTC

> Do you know the origin of these failing
> docs? Were they created by MS Word or
> by OpenOffice or by what ? 

They were created by a post-2003 and pre-2007 version of MS Word. 


> Without a sample file we can't do much.

Just the name itself of the document is 'Business Intelligence', so you can imagine my difficulty. Even other documents that failing are sensitive enough. I thought, I should be able to remove the sensitive parts of this document and then upload it for the Tika/POI developers. But even mere re-saving the document in Word 2007 (i.e., without any new edits whatsoever) makes the problem mostly go away. I say 'mostly' because, while Tika/POI are then able to extract the text, they also append text like this to the output

_-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00]

_-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00]

_-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00]

_-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00]

_-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00]

_-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00]

_-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]


Being a developer myself, I am fully aware how hard it can be to fix (certain) bugs without appropriate test input. I will watch out for newer releases.

Comment 5 Sepp 2012-04-06 16:23:22 UTC

Created attachment 28554 [details]
The same problem with MS PowerPoint files

Hi *,

I have the same problem with tika-app-1.1.jar und MS PowerPoint files. In the zip archive you can find 2 PPT files. The file Tika.ppt is the "old" file, that cannot be converted with the error message:

System.ApplicationException : Extraction of text from the file 'Tika.ppt' failed.
  ----> org.apache.tika.exception.TikaException : Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2a784f5
  ----> java.lang.ArrayIndexOutOfBoundsException : 
at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 63
at TikaOnDotNet.tikadriver_examples.should_extract_from_ppt() in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\tikadriver_examples.cs:line 104
--TikaException
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 55
--ArrayIndexOutOfBoundsException
at IKVM.Runtime.ByteCodeHelper.arraycopy_primitive_1(Array src, Int32 srcStart, Array dest, Int32 destStart, Int32 len)
at org.apache.poi.util.LittleEndian.getByteArray(Byte[] data, Int32 offset, Int32 size)
at org.apache.poi.hpsf.UnicodeString..ctor(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.Vector.read(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.VariantSupport.read(Byte[] src, Int32 offset, Int32 length, Int64 type, Int32 codepage)
at org.apache.poi.hpsf.Property..ctor(Int64 id, Byte[] src, Int64 offset, Int32 length, Int32 codepage)
at org.apache.poi.hpsf.Section..ctor(Byte[] src, Int32 offset)
at org.apache.poi.hpsf.PropertySet.init(Byte[] , Int32 , Int32 )
at org.apache.poi.hpsf.PropertySet..ctor(InputStream stream)
at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(DirectoryNode , String )
at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(DirectoryNode )
at org.apache.tika.parser.microsoft.OfficeParser.parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
at org.apache.tika.parser.microsoft.OfficeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

The second file Tika_new.ppt is the same file, that has been saved with the MS PowerPoint 2010 (File -> Save as...), can be converted without any problems.

With tika-app-0.9.jar the file Tika.ppt can be converted too ==> the error is in the new version of tika-app-1.1.jar???

Thank you
Sepp

Comment 6 Sergey Vladimirov 2012-11-05 16:06:44 UTC

Since there is a problem with original file (i.e. structure is broken), i'm closing this bug as WONTFIX.

But in trunk the workaround will be added to skip the problematic SPRMs. I could NOT guarantee that the file will be correctly processed after such errors, but it worse to try.