Bug 54764

Summary: XSSF : Vulnerable to entity expansion attack
Product: POI Reporter: Phil Persad <philip.persad>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: philip.persad
Priority: P2    
Version: 3.9-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Document which demonstrates entity expansion attack. Do not carelessly open.
Patch file

Description Phil Persad 2013-03-28 22:38:14 UTC
Created attachment 30114 [details]
Document which demonstrates entity expansion attack.  Do not carelessly open.

It is possible to cripple a system which is parsing xml based Office docs using an entity expansion attack.  I've attached a simple billion laughs example .xlsx file.  Please note that this document IS MALICIOUS.  Do not try to parse the contained XML files unless you feel like DoSing your own machine.

I'll also include a patch file which contains a possible solution.  It may be a bit controversial as it will tend to discard an entire document if it detects that any part of it violates reasonable security rules.  That being said, it's well suited to my use case and I thought I'd share.  Some kind soul may, perhaps, refactor it into configurable behaviour.
Comment 1 Phil Persad 2013-03-28 22:39:27 UTC
Created attachment 30115 [details]
Patch file

The afore mentioned patch.
Comment 2 Nick Burch 2013-03-31 08:33:59 UTC
Most of the xml parsing done by POI for the OOXML formats is via Apache XMLBeans. There has been some discussions on tweaking how xmlbeans does the xml processing, for areas like this, if you check the list archives you should hopefully find the details (it's not something I've personally been that involved in)

In addition, there are a few bits of xml parsing relating to the underlying ooxml file format, which came over when the openxml4j project was incorporated into Apache POI. This code was written independently, and so uses a different set of xml processing, to handle the low level file format structure (relations, content types etc). I believe that is the part your patch looks at?

From a quick check of the ooxml related code, it looks like there are a few places where sax parsers are created to work on the ooxml structure. I guess we should probably add a utility helper that creates a more robust sax parser, and then update these places to call that to fetch it. Any chance you could take a look at re-working your patch to do that?
Comment 3 Phil Persad 2013-04-03 21:30:53 UTC
After looking into the issue more closely (while trying to safely parse docx files), I found that the openxml4j code does not pre-parse as many of the various .xml files in a .docx as they do in a .xlsx.  As a result, even if openxml4j is parsing securely, an xml bomb can still sneak past un-parsed and blow up in the XWPFDocument construction.

In the end a simpler solution was to implement pre-parsing of all the xml documents in the docx zip stream with a securely configured parser in the calling application.

I'm not too thrilled with the performance implications, but a node failing due to a bad XML file in an openXML doc is the worst case and worth the performance hit to avoid.
Comment 4 Nick Burch 2014-08-04 19:25:34 UTC
As of r1615720, there are no issues with loading files at the OPC level with hand-crafted nested entities, on the default JVM xml parser. The affected part will be skipped, so if it was eg docProps/core.xml then you just won't see any core properties, but if it was ContentTypes then the whole OPCPackage will be rejected as invalid.

XMLBeans is still an issue though. With XMLBeans 2.3 (the default we ship), it'll hang. With XMLBeans 2.6, the built-in entity expansion detection will trigger at 10,240 bytes expanded and halt parsing. The problem is that XMLBeans doesn't reset the state of the SaxHandler it uses, and that seems to be some sort of singleton. That means that once one naughty file has blown up, all subsequent files will fail to parse as it still thinks it's in the same too-large entity even though it isn't!

It would seem that we need to either fix or workaround this xmlbeans 2.6 bug, then upgrade the xmlbeans runtime that we ship + recommend.

(Because XMLBeans runtimes are happy with older generated classes, but not newer ones, and because some of our users are on platforms which ship + depend on older XMLBeans instances like 2.3, we shouldn't just upgrade the version of xmlbeans we use to compile the schemas with. Instead, once the problem is fixed, we should look to ship 2.3 for compiling but 2.6/fixed by default for running)

If someone has some time to work on this, the thing to do would be to bump the xmlbeans runtime in your dev env up to 2.6, attach the xmlbeans sources, enable the commented out part of TestXSSFBugs.bug54764, then try to work out where the caching of the SaxHandler happens. From that, it'll hopefully be possible to find out if we can workout the lack of byte expansion count reset or not
Comment 5 Nick Burch 2014-08-04 20:20:19 UTC
It seems we're not the only ones to have discovered this xmlbeans 2.6 issue, I've just located https://issues.apache.org/jira/browse/XMLBEANS-512 describing the same problem

A call to SystemCache.get().setSaxLoader(null) seems to fix it, so the fix would seem to be to add that near the start of processing each POIXMLDocument to ensure a non-confused one is always available
Comment 6 Nick Burch 2014-08-04 20:51:07 UTC
In r1615781 I've added a call to SystemCache.get().setSaxLoader(null) to near the top of POIXMLDocument. This means that on XMLBeans 2.6, even without a proper fix for XMLBEANS-512, if one file triggers the entity limit then subsequent normal ones can still be parsed fine. (Looks to be a pretty trivial extra overhead)

I've also bumped the default XMLBeans runtime to 2.6 (schema compiler remains 2.3 for compatibility), added notes to the changelog and site, and enabled the unit tests which show everything behaves itself now when using the JVM-default XML Parser and XMLBeans 2.6.