Bug 54823 - Wrong type on Total Time field in org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.CTProperties
Summary: Wrong type on Total Time field in org.openxmlformats.schemas.officeDocument.x...
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.8-FINAL
Hardware: PC Linux
: P2 trivial (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-10 09:27 UTC by Gjorgji Josifov
Modified: 2015-03-22 22:08 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gjorgji Josifov 2013-04-10 09:27:12 UTC
Hello, devs from Apache POI
I got this error while parsing Microsoft Word document using Apache Tika parser.

org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:125)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
    at xxx.yyyy.services.impl.LuceneServiceImpl.fillDocumentFields(LuceneServiceImpl.java:167)
    at xxx.yyyy.services.impl.LuceneServiceImpl.createLuceneDocumentForFile(LuceneServiceImpl.java:624)
    at xxx.yyyy.services.impl.LuceneServiceImpl.indexNewFile(LuceneServiceImpl.java:650)
    at $LuceneService_63044c23b5df.indexNewFile(Unknown Source)
    at $LuceneService_63044c23b5e0.advised$indexNewFile_63044c23b5fa(Unknown Source)
    at $LuceneService_63044c23b5e0$Invocation_indexNewFile_63044c23b5f9.proceedToAdvisedMethod(Unknown Source)
    at org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:84)
    at xxx.yyyy.services.logging.LoggingAdvice.advise(LoggingAdvice.java:29)
    at org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:86)
    at $LuceneService_63044c23b5e0.indexNewFile(Unknown Source)
    at $LuceneService_63044c23b59b.indexNewFile(Unknown Source)
    at xxx.yyyy.services.impl.IndexScheduleServiceImpl.executeDocumentActions(IndexScheduleServiceImpl.java:119)
    at xxx.yyyy.services.impl.IndexScheduleServiceImpl.access$0(IndexScheduleServiceImpl.java:76)
    at xxx.yyyy.services.impl.IndexScheduleServiceImpl$1.run(IndexScheduleServiceImpl.java:50)
    at org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:178)
    at org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:48)
    at org.apache.tapestry5.ioc.internal.services.ParallelExecutorImpl$1.call(ParallelExecutorImpl.java:58)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.xmlbeans.impl.values.XmlValueOutOfRangeException: Invalid int value: 4294934530
    at org.apache.xmlbeans.impl.values.JavaIntHolder.set_text(JavaIntHolder.java:43)
    at org.apache.xmlbeans.impl.values.XmlObjectBase.update_from_wscanon_text(XmlObjectBase.java:1135)
    at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1274)
    at org.apache.xmlbeans.impl.values.JavaIntHolder.intValue(JavaIntHolder.java:53)
    at org.apache.xmlbeans.impl.values.XmlObjectBase.getIntValue(XmlObjectBase.java:1500)
    at org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.impl.CTPropertiesImpl.getTotalTime(Unknown Source)
    at org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extractMetadata(MetadataExtractor.java:123)
    at org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extract(MetadataExtractor.java:61)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:115)
    ... 27 more

So I investigate the problem and it's seems that line 123 in class org.apache.tika.parser.microsoft.ooxml.MetadataExtractor
    addProperty(metadata, OfficeOpenXMLExtended.TOTAL_TIME, propsHolder.getTotalTime());

Total Time is long at runtime and this excepts only int.
This bug is not related with Apache Tika, but with this interface
org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.CTProperties
which is part of poi-ooxml-schemas ver. 3.8 and used by Apache Tika.
Interface CTProperties defines return type of the method getTotalTime() as int but at runtime is the value is long and it should be changed with long.
My workaround copy classes
MetadataExtractor, OOXMLExtractorFactory and override class OOXMLParser (add method getUnsupportedTypes) and remove parsing of TOTAL_TIME, because I never use this field.
This workaround can be applied when you use Apache Tika for parsing .docx documents.
Best Regards, Gjorgji
p.s I hope I was very detail in my explanation
Comment 1 Nick Burch 2013-04-10 10:05:14 UTC
That class is auto-generated by xmlbeans from the official ooxml schema. Can you check the xml schema and see what that defines the type as being? That'll let us know if it's a bug in xmlbeans, or a bug in the published schema files for the file format...
Comment 2 Gjorgji Josifov 2013-04-10 11:43:26 UTC
Thanks, Nick for point me in right direction
I found ooxml schema
http://www.schemacentral.com/sc/ooxml/ss.html
and here is the schema, which include Total Time property
http://www.schemacentral.com/sc/ooxml/s-shared-documentPropertiesExtended.xsd.html
It say that Total Time is defined as int.
Comment 3 Nick Burch 2013-04-10 11:47:07 UTC
Oh joy, that's a file format bug then...

Do you know what software produced the file, and what said software makes of the value in it's properties display section?
Comment 4 Gjorgji Josifov 2013-04-10 12:29:37 UTC
It's Microsoft Word 2007 and the property is Total editing time which is in minutes that is the value that should be parsed.