Bug 60973 - can't parse some vsdx files
Summary: can't parse some vsdx files
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XDGF (show other bugs)
Version: unspecified
Hardware: All All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-12 07:03 UTC by Gytis
Modified: 2017-04-12 11:13 UTC (History)
0 users



Attachments
visio file example (233.12 KB, application/vnd.ms-visio.drawing)
2017-04-12 08:34 UTC, Gytis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gytis 2017-04-12 07:03:16 UTC
Hi,

1. we're using single core Solr 6.4 instance on windows server (windows server 2012 R2 standard)
2. Java v8, (build 1.8.0_121-b13)
3. ooxml-schemas-1.3.jar, poi-3.15.jar, poi-ooxml-3.15.jar, poi-scratchpad-3.15.jar

But still we have some solrexeptions/errors for ~2000 vsdx files.
It is critical to us have them indexed.

Any solutions from you are welcome.

for most of them I see this error/exception:


org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo'


For example:


{
    "responseHeader": {
        "status": 500, 
        "QTime": 65
    }, 
    "error": {
        "msg": "org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c", 
        "code": 500, 
        "trace": "org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)\r\n\tat org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\r\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)\r\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)\r\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)\r\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)\r\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\r\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\r\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\r\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\r\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\r\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\r\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\r\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\r\n\tat java.lang.Thread.run(Unknown Source)\r\nCaused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)\r\n\t... 32 more\r\nCaused by: org.apache.poi.POIXMLException: /visio/masters/masters.xml: /visio/masters/master11.xml: <Shape ID=\"11\">: Invalid 'Row_Type' name 'PolylineTo'\r\n\tat org.apache.poi.xdgf.exceptions.XDGFException.wrap(XDGFException.java:43)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:107)\r\n\tat org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)\r\n\tat org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)\r\n\tat org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:79)\r\n\tat org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41)\r\n\tat org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:207)\r\n\tat org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)\r\n\tat org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\t... 35 more\r\nCaused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo'\r\n\tat org.apache.poi.xdgf.util.ObjectFactory.load(ObjectFactory.java:45)\r\n\tat org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.load(GeometryRowFactory.java:58)\r\n\tat org.apache.poi.xdgf.usermodel.section.GeometrySection.<init>(GeometrySection.java:55)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFSheet.<init>(XDGFSheet.java:77)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:113)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:125)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:125)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:125)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFShape.<init>(XDGFShape.java:107)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)\r\n\t... 43 more\r\n", 
        "metadata": [
            "error-class", 
            "org.apache.solr.common.SolrException", 
            "root-error-class", 
            "org.apache.poi.POIXMLException"
        ]
    }
}
Comment 1 Nick Burch 2017-04-12 07:20:02 UTC
Can you please attach a small problematic file to this bugzilla ticket, so we can take a look?
Comment 2 Gytis 2017-04-12 08:34:47 UTC
Created attachment 34906 [details]
visio file example
Comment 3 Gytis 2017-04-12 08:35:17 UTC
please see an attached visio file.
Comment 4 Nick Burch 2017-04-12 09:42:56 UTC
Thanks for that, failing (disabled) unit test added in r1791098.

It looks like someone will need to look up the specs on PolylineTo row type, then implement a rowtype class for that. We have a number of other *To row type classes implemented, so hopefully not too much work once someone finds the magic bit in the spec which details how these should work!
Comment 5 Gytis 2017-04-12 10:30:27 UTC
When could we expect for new release with this "fix" implemented?

Maybe there is some possibility to get some beta version earlier.
Comment 6 Nick Burch 2017-04-12 10:35:07 UTC
A nightly build would be available the day after some kind person implements it. A full release is usually done a few times a year

We're all volunteers here! If this bug matters to you, please help by digging through the public documentation and helping work out what this new row type needs to do/implement/etc!
Comment 7 Nick Burch 2017-04-12 11:13:39 UTC
Actually, this turned out to be easier than expected and no spec reading was required - there seems to be two varients, PolylineTo and PolyLineTo - note the L can be l or L. In r1791108 the other form has been added as an alias.

For future XDGF missing feature reference, the public docs on the VSDX file format are linked from http://poi.apache.org/guidelines.html#FileFormatInformation