Created attachment 31572 [details] This PPT is not getting extracted with "poi-3.10.jar" Attach PPT file is not getting extracted. Giving exception as Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2d536558 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5000 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5002 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 20
It would be great if someone could: * Run it through the Microsoft Binary File Format validator, and see if that reports it as valid or invalid? * Load it in PowerPoint, do a save-as, and see if that fixes it? * Load it in Open Office, and see if that is happy with it?
Oh, and try it with POI 3.11 beta 2, just to see if we've already fixed it!
I think this bug is a duplicate of bug 55732 . Anyhow using tika 1.6 (dev build, using POI 3.11b2 or so) I can extract text from this ppt and before I couldn't.
I've used the PowerPointExtractor on the file and it works in the meantime