Bug 40806

Summary: ArrayIndexOutOfBoundsException while opening PPT file
Product: POI Reporter: Tim Riemann <triemann>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.0-dev   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: Simplest possible testcase showing the ArrayIndexOutOfBoundsError

Description Tim Riemann 2006-10-20 15:17:04 UTC
We are using POI to index PPT for a P2P search engine. While crawling the web 
we found this document: http://www.na.unep.net/OnePlanetManyPeople/THEMATIC/
Introduction.ppt which cause the following Exception:

java.lang.ArrayIndexOutOfBoundsException: 57
at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491)
at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:139)
at 
org.apache.poi.hslf.record.StyleTextPropAtom$TextPropCollection.buildTextPropList(StyleTextPropAtom.java:421)
at 
org.apache.poi.hslf.record.StyleTextPropAtom.setParentTextSize(StyleTextPropAtom.java:270)
at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:91)
at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:68)
at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:126)
at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:88)
at org.apache.poi.hslf.model.Slide.<init>(Slide.java:66)
at 
org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:394)
at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:116)
at 
org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:85)
Comment 1 Bj 2006-11-29 04:40:58 UTC
Created attachment 19197 [details]
Simplest possible testcase showing the ArrayIndexOutOfBoundsError

I use POI through Nutch for parsing Office documents.

Note: No exception is thrown when I do my tests, but a lot of ERROR messages
are logged, indicating that something is wrong:

ERROR - ContentReaderListener.extractTextBoxes(322) | extractClientTextBoxes
java.lang.ArrayIndexOutOfBoundsException: -353698944
	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491)
	at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64)
	at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(ContentReaderListener.java:200)

	at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:110)

	at
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:260)

	at
org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:96)
	at
org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtractor.java:49)

	at org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:77)
	at
org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:81)
	at
org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerPointParser.java:44)

	at
no.creuna.documentparser.DocumentParser.parseDocument(DocumentParser.java:156)
	at
test.no.creuna.documentparser.DocumentParserErrorsTest.testArrayIndexOutOfBoundsExceptionErrors(DocumentParserErrorsTest.java:186)
Comment 2 Bj 2006-11-29 04:49:00 UTC
I use POI through Nutch.

When opening the attachment Nutch logs a series of errors from within POI:

ERROR - ContentReaderListener.extractTextBoxes(322) | extractClientTextBoxes
java.lang.ArrayIndexOutOfBoundsException: -353698944
	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491)
	at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64)
	at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(ContentReaderListener.java:200)
	at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:110)
	at
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:260)
	at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:96)
	at
org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtractor.java:49)
	at org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:77)
	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:81)
	at
org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerPointParser.java:44)
	at no.creuna.documentparser.DocumentParser.parseDocument(DocumentParser.java:156)
	at
test.no.creuna.documentparser.DocumentParserErrorsTest.testArrayIndexOutOfBoundsExceptionErrors(DocumentParserErrorsTest.java:186)
Comment 3 Nick Burch 2007-01-16 07:52:08 UTC
I think this problem has now been fixed, thanks to Yegor's new understanding of
the ordering of TextProps in StyleTextPropAtom

I can open your test powerpoint document without any exceptions, so I'm hoping
this is now closed. If you still get problems, can you re-open with a new
problem file?