Bug 53380

Summary: ArrayIndexOutOfBounds Excetion parsing word 97 document
Product: POI Reporter: Tim Barrett <tim.barrett>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: acougarm
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: offending word doc
Blank DOC file that generates the same error in POI.
build 47 fixed some but not all of the errors with old word 97 docs. Attached still throws an exception (array out of bounds)
Bug persists with Word DOC files and latest build (50)

Description Tim Barrett 2012-06-07 10:43:27 UTC
Created attachment 28901 [details]
offending word doc

Out of bounds exception occurs (stack trace below) when parsing attached word 97 doc


Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@393e6226
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:133)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:400)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 18
	at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
	at org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81)
	at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstractType.java:56)
	at org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingDescriptor.java:38)
	at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOperation(CharacterSprmUncompressor.java:582)
	at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:65)
	at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
	at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:77)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Comment 1 acougarm 2012-09-10 05:40:40 UTC
We're running into this same issue with many of our DOC files. When will it be addressed? Thank you.
Comment 2 acougarm 2012-09-10 06:41:19 UTC
Created attachment 29349 [details]
Blank DOC file that generates the same error in POI.

This file contains no text (completely blank) and it still generates the POI exception: ArrayIndexOutOfBounds
Comment 3 Sergey Vladimirov 2012-09-11 19:50:47 UTC
Fixed in trunk.

We had incorrect implementation for sprmCShd80 (0x4866) 0x66 processing, Shd was used instead of Shd80
Comment 4 acougarm 2012-09-11 19:53:41 UTC
Thanks, Sergey, for fixing this :)
Where can we download the latest build? Thanks again!
Comment 5 Sergey Vladimirov 2012-09-11 19:56:32 UTC
I believe it will be here https://builds.apache.org/job/POI/lastSuccessfulBuild/artifact/build/dist/ today (USA time?)
Comment 6 acougarm 2012-09-12 05:41:47 UTC
Thanks, Sergey. Your build with the fix will probably show up today; the one listed there right now is from 10 September.
Comment 7 Tim Barrett 2012-09-13 07:52:13 UTC
(In reply to comment #6)
> Thanks, Sergey. Your build with the fix will probably show up today; the one
> listed there right now is from 10 September.

Hi guys, Looks like the build (46) failed. Any chance of getting one out today? :-)
Comment 8 Yegor Kozlov 2012-09-13 08:49:56 UTC
It was an internal error in Jenkins:

FATAL: Cannot find executable from the choosen Ant installation "Ant (latest)"
Build step 'Invoke Ant' marked build as failure
[WARNINGS] Skipping publisher since build result is FAILURE
Archiving artifacts

Today's rebuild #47 was successfull. 

Yegor

(In reply to comment #7)
> (In reply to comment #6)
> > Thanks, Sergey. Your build with the fix will probably show up today; the one
> > listed there right now is from 10 September.
> 
> Hi guys, Looks like the build (46) failed. Any chance of getting one out
> today? :-)
Comment 9 acougarm 2012-09-13 10:49:29 UTC
Thank you, Sergey and Yegor. The issue has been resolved with Build #47: https://builds.apache.org/job/POI/47/
Comment 10 Tim Barrett 2012-09-19 06:55:23 UTC
Created attachment 29398 [details]
build 47 fixed some but not all of the errors with old word 97 docs. Attached still throws an exception (array out of bounds)
Comment 11 Tim Barrett 2012-09-19 06:56:02 UTC
Still some issus with old word 97 docs
Comment 12 Sergey Vladimirov 2012-09-21 07:15:19 UTC
Tim,

all three files are opened without exceptions. Please try again.

Sergey
Comment 13 acougarm 2012-09-23 09:03:38 UTC
Thanks for the fix. The latest build (49) is broken right now: https://builds.apache.org/job/POI/49/
Comment 14 Sergey Vladimirov 2012-09-23 13:39:37 UTC
There was additional problem with 3rd document provided by Tim. This problem was linked to broken internal structure of lists information in the document (i.e. document was not well-formed).

Today I refactored lists processing, and added a "safe-path" to extract text (HTML, FO) information from such documents.

All HWPF-tests passed, so need to wait for the next build :)
Comment 15 acougarm 2012-09-25 06:59:23 UTC
Created attachment 29416 [details]
Bug persists with Word DOC files and latest build (50)

The ArrayIndexOutOfBounds bug persists with the latest build (#50) of POI. Please test using the attached blank_2.doc Word DOC file to reproduce.
Comment 16 Sergey Vladimirov 2012-09-25 21:41:45 UTC
acougarm, current code doesn't throw any errors on simple file parsing or text extraction. Could you please attach stack trace?
Comment 17 acougarm 2012-09-26 06:53:15 UTC
Thanks, Sergey. We downloaded the latest build from here: https://builds.apache.org/job/POI/50/artifact/build/dist/poi-bin-3.9-beta1-20120924.tar.gz

Here is the stack trace from a Curl command against Solr, using the above build files:

curl "http://localhost:8983/solr/update/extract?extractOnly=true&fmap.content=text" -F "myfile=@blank_2.doc"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">356</in
t></lst><lst name="error"><str name="msg">org.apache.tika.exception.TikaExceptio
n: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse
r@2c164804</str><str name="trace">org.apache.solr.common.SolrException: org.apac
he.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tik
a.parser.microsoft.OfficeParser@2c164804
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:230)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:129)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:240)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:454)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:275)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1337)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
:484)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:119)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.jav
a:524)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl
er.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
er.java:1065)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:
413)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle
r.java:192)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
r.java:999)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:117)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
extHandlerCollection.java:250)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl
ection.java:149)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
tHttpConnection.java:454)
        at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin
gHttpConnection.java:47)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra
ctHttpConnection.java:890)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header
Complete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)

        at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo
nnection.java:66)
        at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So
cketConnector.java:254)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
l.java:599)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
.java:534)
        at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@2c164804
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244
)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:224)
        ... 31 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
        at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
        at org.apache.poi.hwpf.model.Colorref.&lt;init&gt;(Colorref.java:81)
        at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac
tType.java:56)
        at org.apache.poi.hwpf.usermodel.ShadingDescriptor.&lt;init&gt;(ShadingD
escriptor.java:38)
        at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera
tion(CharacterSprmUncompressor.java:582)
        at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char
acterSprmUncompressor.java:65)
        at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
        at org.apache.poi.hwpf.model.StyleSheet.&lt;init&gt;(StyleSheet.java:121
)
        at org.apache.poi.hwpf.HWPFDocument.&lt;init&gt;(HWPFDocument.java:346)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja
va:77)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:185)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:160)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
        ... 34 more
</str><int name="code">500</int></lst>
</response>
Comment 18 Sergey Vladimirov 2012-09-26 07:19:00 UTC
acougarm, it's a stack trace from some old version. Current SVN doesn't have code on CharacterSprmUncompressor.java:582 line neither call to ShadingDescriptor.<init> from CharacterSprmUncompressor::unCompressCHPOperation()
Comment 19 acougarm 2012-09-26 13:40:55 UTC
Sorry about that, Sergey! Please attribute this to operator error :)

I hadn't replaced all the old POI files, and so some of the previous build files were still lingering around. Once I deleted those, everything working beautifully!

Thanks again for your patience.