Bug 53380 - ArrayIndexOutOfBounds Excetion parsing word 97 document
Summary: ArrayIndexOutOfBounds Excetion parsing word 97 document
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 major with 4 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-07 10:43 UTC by Tim Barrett
Modified: 2012-09-26 13:40 UTC (History)
1 user (show)



Attachments
offending word doc (101.00 KB, application/msword)
2012-06-07 10:43 UTC, Tim Barrett
Details
Blank DOC file that generates the same error in POI. (31.00 KB, application/msword)
2012-09-10 06:41 UTC, acougarm
Details
build 47 fixed some but not all of the errors with old word 97 docs. Attached still throws an exception (array out of bounds) (242.50 KB, application/msword)
2012-09-19 06:55 UTC, Tim Barrett
Details
Bug persists with Word DOC files and latest build (50) (33.50 KB, application/msword)
2012-09-25 06:59 UTC, acougarm
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Barrett 2012-06-07 10:43:27 UTC
Created attachment 28901 [details]
offending word doc

Out of bounds exception occurs (stack trace below) when parsing attached word 97 doc


Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@393e6226
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:133)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:400)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 18
	at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
	at org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81)
	at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstractType.java:56)
	at org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingDescriptor.java:38)
	at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOperation(CharacterSprmUncompressor.java:582)
	at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:65)
	at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
	at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:77)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Comment 1 acougarm 2012-09-10 05:40:40 UTC
We're running into this same issue with many of our DOC files. When will it be addressed? Thank you.
Comment 2 acougarm 2012-09-10 06:41:19 UTC
Created attachment 29349 [details]
Blank DOC file that generates the same error in POI.

This file contains no text (completely blank) and it still generates the POI exception: ArrayIndexOutOfBounds
Comment 3 Sergey Vladimirov 2012-09-11 19:50:47 UTC
Fixed in trunk.

We had incorrect implementation for sprmCShd80 (0x4866) 0x66 processing, Shd was used instead of Shd80
Comment 4 acougarm 2012-09-11 19:53:41 UTC
Thanks, Sergey, for fixing this :)
Where can we download the latest build? Thanks again!
Comment 5 Sergey Vladimirov 2012-09-11 19:56:32 UTC
I believe it will be here https://builds.apache.org/job/POI/lastSuccessfulBuild/artifact/build/dist/ today (USA time?)
Comment 6 acougarm 2012-09-12 05:41:47 UTC
Thanks, Sergey. Your build with the fix will probably show up today; the one listed there right now is from 10 September.
Comment 7 Tim Barrett 2012-09-13 07:52:13 UTC
(In reply to comment #6)
> Thanks, Sergey. Your build with the fix will probably show up today; the one
> listed there right now is from 10 September.

Hi guys, Looks like the build (46) failed. Any chance of getting one out today? :-)
Comment 8 Yegor Kozlov 2012-09-13 08:49:56 UTC
It was an internal error in Jenkins:

FATAL: Cannot find executable from the choosen Ant installation "Ant (latest)"
Build step 'Invoke Ant' marked build as failure
[WARNINGS] Skipping publisher since build result is FAILURE
Archiving artifacts

Today's rebuild #47 was successfull. 

Yegor

(In reply to comment #7)
> (In reply to comment #6)
> > Thanks, Sergey. Your build with the fix will probably show up today; the one
> > listed there right now is from 10 September.
> 
> Hi guys, Looks like the build (46) failed. Any chance of getting one out
> today? :-)
Comment 9 acougarm 2012-09-13 10:49:29 UTC
Thank you, Sergey and Yegor. The issue has been resolved with Build #47: https://builds.apache.org/job/POI/47/
Comment 10 Tim Barrett 2012-09-19 06:55:23 UTC
Created attachment 29398 [details]
build 47 fixed some but not all of the errors with old word 97 docs. Attached still throws an exception (array out of bounds)
Comment 11 Tim Barrett 2012-09-19 06:56:02 UTC
Still some issus with old word 97 docs
Comment 12 Sergey Vladimirov 2012-09-21 07:15:19 UTC
Tim,

all three files are opened without exceptions. Please try again.

Sergey
Comment 13 acougarm 2012-09-23 09:03:38 UTC
Thanks for the fix. The latest build (49) is broken right now: https://builds.apache.org/job/POI/49/
Comment 14 Sergey Vladimirov 2012-09-23 13:39:37 UTC
There was additional problem with 3rd document provided by Tim. This problem was linked to broken internal structure of lists information in the document (i.e. document was not well-formed).

Today I refactored lists processing, and added a "safe-path" to extract text (HTML, FO) information from such documents.

All HWPF-tests passed, so need to wait for the next build :)
Comment 15 acougarm 2012-09-25 06:59:23 UTC
Created attachment 29416 [details]
Bug persists with Word DOC files and latest build (50)

The ArrayIndexOutOfBounds bug persists with the latest build (#50) of POI. Please test using the attached blank_2.doc Word DOC file to reproduce.
Comment 16 Sergey Vladimirov 2012-09-25 21:41:45 UTC
acougarm, current code doesn't throw any errors on simple file parsing or text extraction. Could you please attach stack trace?
Comment 17 acougarm 2012-09-26 06:53:15 UTC
Thanks, Sergey. We downloaded the latest build from here: https://builds.apache.org/job/POI/50/artifact/build/dist/poi-bin-3.9-beta1-20120924.tar.gz

Here is the stack trace from a Curl command against Solr, using the above build files:

curl "http://localhost:8983/solr/update/extract?extractOnly=true&fmap.content=text" -F "myfile=@blank_2.doc"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">356</in
t></lst><lst name="error"><str name="msg">org.apache.tika.exception.TikaExceptio
n: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse
r@2c164804</str><str name="trace">org.apache.solr.common.SolrException: org.apac
he.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tik
a.parser.microsoft.OfficeParser@2c164804
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:230)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:129)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:240)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:454)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:275)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1337)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
:484)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:119)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.jav
a:524)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl
er.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
er.java:1065)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:
413)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle
r.java:192)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
r.java:999)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:117)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
extHandlerCollection.java:250)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl
ection.java:149)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
tHttpConnection.java:454)
        at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin
gHttpConnection.java:47)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra
ctHttpConnection.java:890)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header
Complete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)

        at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo
nnection.java:66)
        at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So
cketConnector.java:254)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
l.java:599)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
.java:534)
        at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@2c164804
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244
)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:224)
        ... 31 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
        at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
        at org.apache.poi.hwpf.model.Colorref.&lt;init&gt;(Colorref.java:81)
        at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac
tType.java:56)
        at org.apache.poi.hwpf.usermodel.ShadingDescriptor.&lt;init&gt;(ShadingD
escriptor.java:38)
        at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera
tion(CharacterSprmUncompressor.java:582)
        at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char
acterSprmUncompressor.java:65)
        at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
        at org.apache.poi.hwpf.model.StyleSheet.&lt;init&gt;(StyleSheet.java:121
)
        at org.apache.poi.hwpf.HWPFDocument.&lt;init&gt;(HWPFDocument.java:346)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja
va:77)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:185)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:160)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
        ... 34 more
</str><int name="code">500</int></lst>
</response>
Comment 18 Sergey Vladimirov 2012-09-26 07:19:00 UTC
acougarm, it's a stack trace from some old version. Current SVN doesn't have code on CharacterSprmUncompressor.java:582 line neither call to ShadingDescriptor.<init> from CharacterSprmUncompressor::unCompressCHPOperation()
Comment 19 acougarm 2012-09-26 13:40:55 UTC
Sorry about that, Sergey! Please attribute this to operator error :)

I hadn't replaced all the old POI files, and so some of the previous build files were still lingering around. Once I deleted those, everything working beautifully!

Thanks again for your patience.