Summary: | java.lang.ArrayIndexOutOfBoundsException in org.apache.poi.hwpf.sprm.SprmOperation.initSize | ||
---|---|---|---|
Product: | POI | Reporter: | HarrySimons <simonsharry> |
Component: | HWPF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | CC: | mseele, sepp, simonsharry |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Linux | ||
Attachments: | The same problem with MS PowerPoint files |
Description
HarrySimons
2012-03-09 02:14:30 UTC
Can you upload a failing document? Yegor Unfortunately, this is a classified document. Do you know the origin of these failing docs? Were they created by MS Word or by OpenOffice or by what ? Without a sample file we can't do much. Yegor (In reply to comment #2) > Unfortunately, this is a classified document. > Do you know the origin of these failing > docs? Were they created by MS Word or > by OpenOffice or by what ? They were created by a post-2003 and pre-2007 version of MS Word. > Without a sample file we can't do much. Just the name itself of the document is 'Business Intelligence', so you can imagine my difficulty. Even other documents that failing are sensitive enough. I thought, I should be able to remove the sensitive parts of this document and then upload it for the Tika/POI developers. But even mere re-saving the document in Word 2007 (i.e., without any new edits whatsoever) makes the problem mostly go away. I say 'mostly' because, while Tika/POI are then able to extract the text, they also append text like this to the output _-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00] _-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00] _-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00] _-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00] _-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00] _-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00] _-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] Being a developer myself, I am fully aware how hard it can be to fix (certain) bugs without appropriate test input. I will watch out for newer releases. Created attachment 28554 [details]
The same problem with MS PowerPoint files
Hi *,
I have the same problem with tika-app-1.1.jar und MS PowerPoint files. In the zip archive you can find 2 PPT files. The file Tika.ppt is the "old" file, that cannot be converted with the error message:
System.ApplicationException : Extraction of text from the file 'Tika.ppt' failed.
----> org.apache.tika.exception.TikaException : Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2a784f5
----> java.lang.ArrayIndexOutOfBoundsException :
at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 63
at TikaOnDotNet.tikadriver_examples.should_extract_from_ppt() in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\tikadriver_examples.cs:line 104
--TikaException
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 55
--ArrayIndexOutOfBoundsException
at IKVM.Runtime.ByteCodeHelper.arraycopy_primitive_1(Array src, Int32 srcStart, Array dest, Int32 destStart, Int32 len)
at org.apache.poi.util.LittleEndian.getByteArray(Byte[] data, Int32 offset, Int32 size)
at org.apache.poi.hpsf.UnicodeString..ctor(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.Vector.read(Byte[] , Int32 )
at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 )
at org.apache.poi.hpsf.VariantSupport.read(Byte[] src, Int32 offset, Int32 length, Int64 type, Int32 codepage)
at org.apache.poi.hpsf.Property..ctor(Int64 id, Byte[] src, Int64 offset, Int32 length, Int32 codepage)
at org.apache.poi.hpsf.Section..ctor(Byte[] src, Int32 offset)
at org.apache.poi.hpsf.PropertySet.init(Byte[] , Int32 , Int32 )
at org.apache.poi.hpsf.PropertySet..ctor(InputStream stream)
at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(DirectoryNode , String )
at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(DirectoryNode )
at org.apache.tika.parser.microsoft.OfficeParser.parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
at org.apache.tika.parser.microsoft.OfficeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
The second file Tika_new.ppt is the same file, that has been saved with the MS PowerPoint 2010 (File -> Save as...), can be converted without any problems.
With tika-app-0.9.jar the file Tika.ppt can be converted too ==> the error is in the new version of tika-app-1.1.jar???
Thank you
Sepp
Since there is a problem with original file (i.e. structure is broken), i'm closing this bug as WONTFIX. But in trunk the workaround will be added to skip the problematic SPRMs. I could NOT guarantee that the file will be correctly processed after such errors, but it worse to try. |