We are using POI to index PPT for a P2P search engine. While crawling the web we found this document: http://www.na.unep.net/OnePlanetManyPeople/THEMATIC/ Introduction.ppt which cause the following Exception: java.lang.ArrayIndexOutOfBoundsException: 57 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491) at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:139) at org.apache.poi.hslf.record.StyleTextPropAtom$TextPropCollection.buildTextPropList(StyleTextPropAtom.java:421) at org.apache.poi.hslf.record.StyleTextPropAtom.setParentTextSize(StyleTextPropAtom.java:270) at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:91) at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:68) at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:126) at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:88) at org.apache.poi.hslf.model.Slide.<init>(Slide.java:66) at org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:394) at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:116) at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:85)
Created attachment 19197 [details] Simplest possible testcase showing the ArrayIndexOutOfBoundsError I use POI through Nutch for parsing Office documents. Note: No exception is thrown when I do my tests, but a lot of ERROR messages are logged, indicating that something is wrong: ERROR - ContentReaderListener.extractTextBoxes(322) | extractClientTextBoxes java.lang.ArrayIndexOutOfBoundsException: -353698944 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491) at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(ContentReaderListener.java:200) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:110) at org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:260) at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:96) at org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtractor.java:49) at org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:77) at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:81) at org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerPointParser.java:44) at no.creuna.documentparser.DocumentParser.parseDocument(DocumentParser.java:156) at test.no.creuna.documentparser.DocumentParserErrorsTest.testArrayIndexOutOfBoundsExceptionErrors(DocumentParserErrorsTest.java:186)
I use POI through Nutch. When opening the attachment Nutch logs a series of errors from within POI: ERROR - ContentReaderListener.extractTextBoxes(322) | extractClientTextBoxes java.lang.ArrayIndexOutOfBoundsException: -353698944 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491) at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(ContentReaderListener.java:200) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:110) at org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:260) at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:96) at org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtractor.java:49) at org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:77) at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:81) at org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerPointParser.java:44) at no.creuna.documentparser.DocumentParser.parseDocument(DocumentParser.java:156) at test.no.creuna.documentparser.DocumentParserErrorsTest.testArrayIndexOutOfBoundsExceptionErrors(DocumentParserErrorsTest.java:186)
I think this problem has now been fixed, thanks to Yegor's new understanding of the ordering of TextProps in StyleTextPropAtom I can open your test powerpoint document without any exceptions, so I'm hoping this is now closed. If you still get problems, can you re-open with a new problem file?