Bug 61265

Summary: AIOOBE when reading doc file (Section table parsing)
Product: POI Reporter: gaurav.chd3
Component: HWPFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: major CC: gaurav.chd3
Priority: P2    
Version: 3.16-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: unable to parse

Description gaurav.chd3 2017-07-07 20:01:54 UTC
Created attachment 35104 [details]
unable to parse

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@547eb45
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:357)
	at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:308)
	at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
	at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
	at javax.swing.TransferHandler.importData(Unknown Source)
	at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
	at java.awt.dnd.DropTarget.drop(Unknown Source)
	at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown Source)
	at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
	at java.awt.Component.dispatchEventImpl(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
	at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Window.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
	at java.awt.EventQueue.access$500(Unknown Source)
	at java.awt.EventQueue$3.run(Unknown Source)
	at java.awt.EventQueue$3.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue$4.run(Unknown Source)
	at java.awt.EventQueue$4.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue.dispatchEvent(Unknown Source)
	at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.poi.hwpf.model.SectionTable.<init>(SectionTable.java:84)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:288)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:157)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:171)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	... 43 more
Comment 1 PJ Fanning 2017-07-07 20:47:56 UTC
That issue appears with latest poi code.
One possible workaround is to try docx.
I converted the file to docx and it parsed ok.
Comment 2 Javen O'Neal 2017-07-07 22:56:27 UTC
Does POI choke if you re-save as doc in Word? Sometimes different versions of Word or other office software produce different files.
Comment 3 Javen O'Neal 2017-07-07 22:59:30 UTC
L
Comment 4 gaurav.chd3 2017-07-08 04:34:27 UTC
Converting to docx does not seem to be the fix as the file conversion process changes the file meta data. 

Is there is a way to update POI to handle the original doc file?
Comment 5 PJ Fanning 2017-07-08 07:33:50 UTC
I converted the doc file using MS Word. I don't think that POI can be used to convert the file right now.
Comment 6 PJ Fanning 2017-07-08 07:40:14 UTC
Using MS Word to resave as doc or docx makes the file parseable in POI.
I used MS Word on a Mac (Word v15.29).
Comment 7 gaurav.chd3 2017-07-09 09:33:36 UTC
HOw to resave the file from doc to docx while retaining the original file meta data (especially content creation date)?