Bug 56936 - [PATCH] Upload of PPTX causes very high memory usage leading to system instability
Summary: [PATCH] Upload of PPTX causes very high memory usage leading to system instab...
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: XSLF (show other bugs)
Version: 3.9-FINAL
Hardware: All All
: P2 critical with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
Depends on:
Blocks:
 
Reported: 2014-09-09 15:39 UTC by Dmitry
Modified: 2015-11-16 23:12 UTC (History)
0 users



Attachments
The patch for the POI OOXML library (tar.gz archive, created via the 'ant -f patch.xml' command) (1.02 KB, application/gzip)
2014-09-09 15:39 UTC, Dmitry
Details
Thread stack trace dumps of problematic PPTX analyzing (plain text file) (81.20 KB, text/plain)
2014-09-09 15:43 UTC, Dmitry
Details
Screenshot of profile metrics of problematic PPTX analyzing before patching (88.24 KB, image/jpeg)
2014-09-09 15:45 UTC, Dmitry
Details
Screenshot of profile metrics of problematic PPTX analyzing after patching (81.56 KB, image/jpeg)
2014-09-09 15:46 UTC, Dmitry
Details
custom xml options patch (12.79 KB, patch)
2014-10-23 23:19 UTC, Andreas Beeker
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry 2014-09-09 15:39:12 UTC
Created attachment 31978 [details]
The patch for the POI OOXML library (tar.gz archive, created via the 'ant -f patch.xml' command)

Hi.

We experience high memory consumption during analysis certain PPTX documents by POI 3.9-beta1. Unfortunately, it is not easy to create such a problematic document. And we cannot attach our problematic documents, because they contain valuable information. Also we don’t know the version of MS Office where these documents were created.

Analysis has shown that the problem is mainly in the slides with the formula element. Elements of formula contain symbols in ‘UTF-16LE’ encoding such as ‘
Comment 1 Dmitry 2014-09-09 15:43:19 UTC
Created attachment 31979 [details]
Thread stack trace dumps of problematic PPTX analyzing (plain text file)
Comment 2 Dmitry 2014-09-09 15:45:40 UTC
Created attachment 31980 [details]
Screenshot of profile metrics of problematic PPTX analyzing before patching
Comment 3 Dmitry 2014-09-09 15:46:08 UTC
Created attachment 31981 [details]
Screenshot of profile metrics of problematic PPTX analyzing after patching
Comment 4 Dmitry 2014-09-09 15:51:55 UTC
(In reply to Dmitry from comment #0)
> Created attachment 31978 [details]
> The patch for the POI OOXML library (tar.gz archive, created via the 'ant -f
> patch.xml' command)
> 
> Hi.
> 
> We experience high memory consumption during analysis certain PPTX documents
> by POI 3.9-beta1. Unfortunately, it is not easy to create such a problematic
> document. And we cannot attach our problematic documents, because they
> contain valuable information. Also we don’t know the version of MS Office
> where these documents were created.
> 
> Analysis has shown that the problem is mainly in the slides with the formula
> element. Elements of formula contain symbols in ‘UTF-16LE’ encoding such as ‘

Comment 5 Dmitry 2014-09-09 15:53:02 UTC

    
Comment 6 Dmitry 2014-09-09 15:54:20 UTC

    
Comment 7 Dmitry 2014-09-09 16:04:52 UTC
I apologize for the incomplete description and empty comments. I tried to send sample characters in the ‘UTF-16LE’ encoding, but all the text beginning with those characters was cut.

Hi.

We experience high memory consumption during analysis certain PPTX documents by POI 3.9-beta1. Unfortunately, it is not easy to create such a problematic document. And we cannot attach our problematic documents, because they contain valuable information. Also we don’t know the version of MS Office where these documents were created.

Analysis has shown that the problem is mainly in the slides with the formula element. Elements of formula contain symbols in ‘UTF-16LE’ encoding. The presentation may have several slides with the formula, but only some specific slide causes high memory consumption.

The cause of this problem is SAX parser called Piccolo which is used by ‘XmlBeans’ by default. 'CharUtil' at Piccolo incorrectly processes text of the problematic slides and allocating more and more memory until OOM is reached. We did not perform in-depth investigation of this problem. Thread stack trace dumps are attached to a ticket as 'pptx-analysis-stack-trace.txt' (plain text file).

We tried to figure out how to set up another XML reader for POI, but we stumbled upon the following limitation in the POI source code (org.apache.poi.xslf.usermodel.XSLFSlide):

    /**
     * Construct a SpreadsheetML slide from a package part
     *
     * @param part the package part holding the slide data,
     * the content type must be <code>application/vnd.openxmlformats-officedocument.slide+xml</code>
     * @param rel  the package relationship holding this slide,
     * the relationship type must be http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide
     */
    XSLFSlide(PackagePart part, PackageRelationship rel) throws IOException, XmlException {
        super(part, rel);

        SldDocument doc =
            SldDocument.Factory.parse(getPackagePart().getInputStream());
        _slide = doc.getSld();
        setCommonSlideData(_slide.getCSld());
    }

Execution is delegated to the ‘XmlBeans’ implementation in this constructor. And this implementation uses ‘XmlOptions’ to get configuration of XML API (‘SAXParserFactory’ and ‘XMLReader’ implementions and so on). But ‘XSLFSlide’ does not put ‘XmlOptions’ parameter to ‘SldDocument.Factory.parse()’ method. Hence, ‘XmlBeans’ always selects Piccolo parser.

We have performed some tests and have come up with a patch to ‘poi-ooxml’ lib (specifically the ‘XSLFSlide’ class), where the Xerces SAX parser is used instead of the Piccolo SAX parser in the class. This change dramatically effects the system resources and no spike is seen in memory at all (compare ‘pptx-analysis-before-patching.jpg’ with ‘pptx-analysis-after-patching.jpg’ profiling metrics screenshots attached to the ticket).

We have attached the patch as TAR.GZ archive called ‘patch.tar.gz’ (created using ‘ant -f patch.xml’ command). We hope very much that this patch will be applied in future versions of POI, because POI 3.10-FINAL also does not allow set up XML API.

But the best option is the ability to customize XML API implementation for POI in accordance with the following documentation (including SAXParserFactory, XMLReader and so on):

http://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/SAXParserFactory.html#newInstance%28%29

Please, let me know if I should report this kind of improvement as a separate ticket.

Looking forward for your input.

Regards,

Dmitry
Comment 8 Andreas Beeker 2014-10-23 23:19:38 UTC
Created attachment 32141 [details]
custom xml options patch

How about not hard coding the xmlreader inside the xslfslide instead providing a mechanism to generally override the used xmloption for all classes?
(see TestXSLFBugs.bug56936())

Please give it a try and see if it works for you.
Comment 9 Andreas Beeker 2015-11-16 23:12:13 UTC
Although I've provided a patch over a year ago - there was no feedback.
As I'm not convinced of neither of the patches, I'm closing this now as wont 
fix ...

Andi