File too large to upload (2.5mb), it may be found here... http://www.cdt.org/files/CDT_Data_Retention-PPT.ppt It is a Microsoft Powerpoint 97 presentation [ https://issues.apache.org/jira/browse/TIKA-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810577#comment-13810577 ] Nick Burch commented on TIKA-1189: ---------------------------------- This looks to be a bug in Apache POI, one of the upstream libraries that Tika uses. Any chance you could open a bug in the POI Bugzilla - http://issues.apache.org/bugzilla/buglist.cgi?product=POI - and attach the file there? Fails to parse PPT file ----------------------- Key: TIKA-1189 URL: https://issues.apache.org/jira/browse/TIKA-1189 Project: Tika Issue Type: Bug Components: cli, gui Environment: OSX 10.9, OSX 10.6 Reporter: Aimee Dev Attachments: CDT_Data_Retention-PPT.ppt Out of the box tika application when presented with the file results in Apache Tika was unable to parse the document at /Volumes/FREECOM_HDD/Test/CDT_Data_Retention-PPT.ppt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@224f9db at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(TransferHandler.java:826) at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1536) at java.awt.dnd.DropTarget.drop(DropTarget.java:450) at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1274) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:537) at sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:127) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:851) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:775) at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48) at java.awt.Component.dispatchEventImpl(Component.java:4716) at java.awt.Container.dispatchEventImpl(Container.java:2287) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4566) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4417) at java.awt.Container.dispatchEventImpl(Container.java:2273) at java.awt.Window.dispatchEventImpl(Window.java:2719) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) at java.awt.EventQueue.access$200(EventQueue.java:103) at java.awt.EventQueue$3.run(EventQueue.java:694) at java.awt.EventQueue$3.run(EventQueue.java:692) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) at java.awt.EventQueue$4.run(EventQueue.java:708) at java.awt.EventQueue$4.run(EventQueue.java:706) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138) at java.awt.EventDispatchThread.run(EventDispatchThread.java:91) Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5000 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5002 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 56 at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:185) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128) at org.apache.poi.hslf.model.SimpleShape.getClientRecords(SimpleShape.java:347) at org.apache.poi.hslf.model.SimpleShape.getClientDataRecord(SimpleShape.java:319) at org.apache.poi.hslf.model.TextShape.getPlaceholderAtom(TextShape.java:591) at org.apache.poi.hslf.model.Sheet.getPlaceholder(Sheet.java:438) at org.apache.poi.hslf.model.HeadersFooters.isVisible(HeadersFooters.java:244) at org.apache.poi.hslf.model.HeadersFooters.isHeaderVisible(HeadersFooters.java:148) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:62) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 42 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor12.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) ... 53 more Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5002 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 56 at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:185) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128) at org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren.<init>(DummyPositionSensitiveRecordWithChildren.java:52) ... 57 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor12.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) ... 59 more Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 56 at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:185) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128) at org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren.<init>(DummyPositionSensitiveRecordWithChildren.java:52) ... 63 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor11.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) ... 65 more Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 56 at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:185) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128) at org.apache.poi.hslf.record.BinaryTagDataBlob.<init>(BinaryTagDataBlob.java:52) ... 69 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) ... 71 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 56 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163) at org.apache.poi.hslf.record.StyleTextProp9Atom.<init>(StyleTextProp9Atom.java:70) ... 76 more -- This message was sent by Atlassian JIRA (v6.1#6144)
Created attachment 31329 [details] errornous PPT-File - just textarea store in test-data\document
Created attachment 31330 [details] suggested patch store in src\scratchpad\src\org\apache\poi\hslf\record
Created attachment 31331 [details] suggested testcase store in src\testcases\org\apache\poi\hslf\model
I can confirm this bug with my own old ´97 PPT which contains nothing more than an empty Text-Area. Caused by: java.lang.ArrayIndexOutOfBoundsException: 20 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:161) at org.apache.poi.hslf.record.StyleTextProp9Atom.<init>(StyleTextProp9Atom.java:70) ... 65 more I made a small test-case (attached) and a suggested solution (attached too) as a patch of class org.apache.poi.hslf.record.StyleTextProp9Atom. Before reading the (not used) fields textCfException9 and textSiException I check if the offset is already behind the array size. if (i >= data.length) { break; } Since both fields are NOT used anywhere I think it should be safe to skip reading them in this case. With my patch two of my checked files with same error succeed to parse and I could extract text. I would really appreciate if you could integrate this patch because I´m using poi/tika for indexing a great bunch of office files and a lot of them seem to fail because of the same error.
Thank you for the patch. Committed with SVN ver r1569984 and because the original file still failed - another check was necessary - r1569999