Hello, devs from Apache POI I got this error while parsing Microsoft Word document using Apache Tika parser. org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:125) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) at xxx.yyyy.services.impl.LuceneServiceImpl.fillDocumentFields(LuceneServiceImpl.java:167) at xxx.yyyy.services.impl.LuceneServiceImpl.createLuceneDocumentForFile(LuceneServiceImpl.java:624) at xxx.yyyy.services.impl.LuceneServiceImpl.indexNewFile(LuceneServiceImpl.java:650) at $LuceneService_63044c23b5df.indexNewFile(Unknown Source) at $LuceneService_63044c23b5e0.advised$indexNewFile_63044c23b5fa(Unknown Source) at $LuceneService_63044c23b5e0$Invocation_indexNewFile_63044c23b5f9.proceedToAdvisedMethod(Unknown Source) at org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:84) at xxx.yyyy.services.logging.LoggingAdvice.advise(LoggingAdvice.java:29) at org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:86) at $LuceneService_63044c23b5e0.indexNewFile(Unknown Source) at $LuceneService_63044c23b59b.indexNewFile(Unknown Source) at xxx.yyyy.services.impl.IndexScheduleServiceImpl.executeDocumentActions(IndexScheduleServiceImpl.java:119) at xxx.yyyy.services.impl.IndexScheduleServiceImpl.access$0(IndexScheduleServiceImpl.java:76) at xxx.yyyy.services.impl.IndexScheduleServiceImpl$1.run(IndexScheduleServiceImpl.java:50) at org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:178) at org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:48) at org.apache.tapestry5.ioc.internal.services.ParallelExecutorImpl$1.call(ParallelExecutorImpl.java:58) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.xmlbeans.impl.values.XmlValueOutOfRangeException: Invalid int value: 4294934530 at org.apache.xmlbeans.impl.values.JavaIntHolder.set_text(JavaIntHolder.java:43) at org.apache.xmlbeans.impl.values.XmlObjectBase.update_from_wscanon_text(XmlObjectBase.java:1135) at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1274) at org.apache.xmlbeans.impl.values.JavaIntHolder.intValue(JavaIntHolder.java:53) at org.apache.xmlbeans.impl.values.XmlObjectBase.getIntValue(XmlObjectBase.java:1500) at org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.impl.CTPropertiesImpl.getTotalTime(Unknown Source) at org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extractMetadata(MetadataExtractor.java:123) at org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extract(MetadataExtractor.java:61) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:115) ... 27 more So I investigate the problem and it's seems that line 123 in class org.apache.tika.parser.microsoft.ooxml.MetadataExtractor addProperty(metadata, OfficeOpenXMLExtended.TOTAL_TIME, propsHolder.getTotalTime()); Total Time is long at runtime and this excepts only int. This bug is not related with Apache Tika, but with this interface org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.CTProperties which is part of poi-ooxml-schemas ver. 3.8 and used by Apache Tika. Interface CTProperties defines return type of the method getTotalTime() as int but at runtime is the value is long and it should be changed with long. My workaround copy classes MetadataExtractor, OOXMLExtractorFactory and override class OOXMLParser (add method getUnsupportedTypes) and remove parsing of TOTAL_TIME, because I never use this field. This workaround can be applied when you use Apache Tika for parsing .docx documents. Best Regards, Gjorgji p.s I hope I was very detail in my explanation
That class is auto-generated by xmlbeans from the official ooxml schema. Can you check the xml schema and see what that defines the type as being? That'll let us know if it's a bug in xmlbeans, or a bug in the published schema files for the file format...
Thanks, Nick for point me in right direction I found ooxml schema http://www.schemacentral.com/sc/ooxml/ss.html and here is the schema, which include Total Time property http://www.schemacentral.com/sc/ooxml/s-shared-documentPropertiesExtended.xsd.html It say that Total Time is defined as int.
Oh joy, that's a file format bug then... Do you know what software produced the file, and what said software makes of the value in it's properties display section?
It's Microsoft Word 2007 and the property is Total editing time which is in minutes that is the value that should be parsed.