Bug 54916 - [PATCH] POI does not always read all the slides in pptx files
Summary: [PATCH] POI does not always read all the slides in pptx files
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSLF (show other bugs)
Version: 3.10-dev
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
Depends on:
Blocks:
 
Reported: 2013-05-01 02:07 UTC by virtuald
Modified: 2016-01-10 20:48 UTC (History)
2 users (show)



Attachments
Patch for bug (29.11 KB, application/x-gzip)
2013-05-01 02:07 UTC, virtuald
Details
Updated patch with diagnostic warning log messages (29.55 KB, application/x-gzip)
2013-05-01 19:04 UTC, virtuald
Details
delegate-example to handle parent relations (4.01 KB, text/plain)
2014-02-08 21:51 UTC, Andreas Beeker
Details
Patch for deprecating packagerelationship calls (59.69 KB, application/zip)
2015-12-14 00:02 UTC, Andreas Beeker
Details
2nd version of the relation fix (23.58 KB, application/zip)
2016-01-01 20:44 UTC, Andreas Beeker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description virtuald 2013-05-01 02:07:56 UTC
Created attachment 30248 [details]
Patch for bug

I believe this bug may affect all XML-based formats that POI supports, but my specific use case is with pptx files.

For some reason, a number of pptx files that I was given contain slides with rels that contain relationships to other slides in the ppt/slides/_rels/slideX.xml.rels files. Microsoft Office 2010 is able to deal with these files and open them correctly.

When POI encounters such a file, generally it will omit one or more of the slides when you call getSlides() on the XMLSlideShow object. When this occurs, error messages such as "Slide with r:id 256 was defined, but didn't exist in package, skipping" will be logged. 

I believe what is happening is during building of the parse tree, POI treats all references/relationships as equal, and if there are two or more relationships that occur in the file, then POI uses the first occurrence to set the package relationship field. If the first time that POI runs across it is as a child reference, then the package relationship field is set incorrectly, and later on POI sometimes may not be able to resolve it correctly. 

I've determined that the simplest fix is to first parse all elements at a given level, before trying to resolve elements at lower levels. This fixes my problem, and doesn't break any unit tests.

The attached patch contains the fix, and a sample pptx file that exhibits this behavior with a very simple test case.
Comment 1 virtuald 2013-05-01 19:04:53 UTC
Created attachment 30249 [details]
Updated patch with diagnostic warning log messages

Some testing has shown that other element types sometimes get duplicated, particularly slideMaster elements. One of the problems with diagnosing this problem is that POI doesn't try to detect when this problem occurs, which could be useful for trying to figure out why corrupted pptx files aren't working. I've updated my patch with some log messages that can assist when things aren't quite how they should be.
Comment 2 Nick Burch 2013-06-26 00:46:34 UTC
I've added a disabled unit test in r1496696, based on your supplied file

Patch still needs a little bit more thinking, and possibly a deeper fix...
Comment 3 Andreas Beeker 2014-02-02 02:56:49 UTC
That's really a pity ... I've implemented exactly the same patch (apart of the checks in XMLSlideShow ...) and was just before committing it ...

I think, the duplicated relations (but with different parent/ref-id) should be handled through some kind of proxy or delegate, so when the part is called from the outside, it will behave (i.e. return its ref-ids) like it would belong to the embracing part ...

I have a look (and hope) to patch this soon ...
Comment 4 Andreas Beeker 2014-02-08 21:51:37 UTC
Created attachment 31296 [details]
delegate-example to handle parent relations

In the attached example I'd like to demonstrate what I want to change in those many subclasses of POIXMLDocumentPart.

The main problem (as stated in the bug thread) is, that the current implementation doesn't track the (incoming or parent) package relations right, i.e. a referenced child has only one package relation, which is definitely not true for shared objects like slide masters/layouts.

The straightforward implementation would be, to deprecate POIXMLDocumentPart.getPackageRelationship() and introduce a POIXMLDocumentPart.getPackageRelationship(POIXMLDocumentPart parent) method instead, but as the code is already out in the wild, this is probably not the way to go.

So instead I'd like to introduce a delegate mechanism. When the objects are linked together in POIXMLDocumentPart.read() this would take to care to create the necessary delegates.

The delegate itself is a simple inner class extending the outer class so instance-of checks stay valid and the delegate can be used the same way the original class was.

Instead of duplicating all method calls in the inner class and maybe forgetting to insert a call when a new method is introduced, I thought it's better to handle it via the following method prefix ...

public sometype methodname(args ...) {
   if (isDelegate()) {
      return getThis().methodname(args ...);
   }
   // and now the normal method content
}

So what do you think about this?
(... or maybe I missed a point of how these ooxml classes work ...)
Comment 5 vladk 2014-06-05 08:38:32 UTC
I am observing a related issue when loading one specific PPTX file, that I may not post here, unfortunately.

Some slideMasters cannot be retrieved via XmlSlideShow.getSlideMasters(). The problem is in org.apache.poi.xslf.usermodel.XMLSlideShow.onDocumentRead() where the _masters instance variable is being populated.

In my case a possible fix would be to replace:

_masters.put(p.getPackageRelationship().getId(), master);

by

_masters.put(getRelationId(p), master);

This change should fix the problem for slideMasters only.
Comment 6 Nick Burch 2014-06-05 08:52:08 UTC
(In reply to vladk from comment #5)
> I am observing a related issue when loading one specific PPTX file, that I
> may not post here, unfortunately.

Any chance you could unzip the .pptx file, then zip up just the _rels files and post that? That should let us check if it's the same issue, without sharing any of the actual content.
Comment 7 vladk 2014-06-05 09:16:11 UTC
There are two slide masters:
slideMaster1.xml / rId8 and slideMaster2.xml / rId9, but they both have the same rId1 when referencing from their slideLayouts.

presentation.xml.rels:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideMaster" Target="slideMasters/slideMaster1.xml"/>
	<Relationship Id="rId13" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide4.xml"/>
	<Relationship Id="rId18" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide9.xml"/>
	<Relationship Id="rId26" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide17.xml"/>
	<Relationship Id="rId39" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/tags" Target="tags/tag1.xml"/>
	<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item3.xml"/>
	<Relationship Id="rId21" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide12.xml"/>
	<Relationship Id="rId34" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide25.xml"/>
	<Relationship Id="rId42" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
	<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item7.xml"/>
	<Relationship Id="rId12" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide3.xml"/>
	<Relationship Id="rId17" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide8.xml"/>
	<Relationship Id="rId25" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide16.xml"/>
	<Relationship Id="rId33" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide24.xml"/>
	<Relationship Id="rId38" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/notesMaster" Target="notesMasters/notesMaster1.xml"/>
	<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item2.xml"/>
	<Relationship Id="rId16" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide7.xml"/>
	<Relationship Id="rId20" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide11.xml"/>
	<Relationship Id="rId29" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide20.xml"/>
	<Relationship Id="rId41" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/viewProps" Target="viewProps.xml"/>
	<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item1.xml"/>
	<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item6.xml"/>
	<Relationship Id="rId11" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide2.xml"/>
	<Relationship Id="rId24" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide15.xml"/>
	<Relationship Id="rId32" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide23.xml"/>
	<Relationship Id="rId37" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide28.xml"/>
	<Relationship Id="rId40" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/presProps" Target="presProps.xml"/>
	<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item5.xml"/>
	<Relationship Id="rId15" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide6.xml"/>
	<Relationship Id="rId23" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide14.xml"/>
	<Relationship Id="rId28" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide19.xml"/>
	<Relationship Id="rId36" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide27.xml"/>
	<Relationship Id="rId10" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide1.xml"/>
	<Relationship Id="rId19" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide10.xml"/>
	<Relationship Id="rId31" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide22.xml"/>
	<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item4.xml"/>
	<Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideMaster" Target="slideMasters/slideMaster2.xml"/>
	<Relationship Id="rId14" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide5.xml"/>
	<Relationship Id="rId22" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide13.xml"/>
	<Relationship Id="rId27" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide18.xml"/>
	<Relationship Id="rId30" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide21.xml"/>
	<Relationship Id="rId35" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide26.xml"/>
	<Relationship Id="rId43" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/tableStyles" Target="tableStyles.xml"/>
</Relationships>

slideLayout1.xml.rels:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image3.emf"/>
	<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image2.jpeg"/>
	<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideMaster" Target="../slideMasters/slideMaster1.xml"/>
</Relationships>

slidelayout17.xml.rels:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image3.emf"/>
	<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideMaster" Target="../slideMasters/slideMaster2.xml"/>
</Relationships>
Comment 8 Tim Allison 2014-06-06 13:24:48 UTC
http://social.msdn.microsoft.com/Forums/office/en-US/8982d0bb-9469-4f9c-985d-57fadcc9a655/apparent-duplicate-id-in-relationship-office-open-xml-elements?forum=oxmlsdk

Looks like we'll need to store relationship ids per "relationships" object?  Ugh...
Comment 9 Arjohn Kampman 2015-11-20 10:00:05 UTC
I've been running into this bug this week: text extraction from a pptx file is ignoring one sheet entirely. I have debugged the issue and found out that the bug matches this bug report exactly:

/ppt/_rels/presentation.xml.rels has this line:
   <Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slides/slide5.xml"/>

but this mapping of overwritten when a little later this relation is found in /ppt/slides/_rels/slide3.xml:
   <Relationship Id="rId15" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slide" Target="slide5.xml"/>

As a result of this, the text extractor cannot find slide5 anymore.

Are there any plans to fix this bug?
Comment 10 Andreas Beeker 2015-11-20 15:10:34 UTC
Yes, there a plans to fix this bug - I'll take care of it.

I've used to think, that there should be only a minimum of api change, when an error occur, but regarding this patch, I prefer to remove the misleading methods ...

I'll have a look at it, when 3.14-Beta1 is out.

Andi
Comment 11 Andreas Beeker 2015-12-14 00:02:07 UTC
In the relation-patch.zip I've deprecated (hopefully) all calls having a 
PackageRelationship, because (as noted above) there are a few DocumentParts 
which have more than one incoming relationship.

After this has been applied, the next step would be, to remove all calls having 
a parent, which is basically the same as one-incoming-relationship,
but I'm not sure if this cause any side-effects and therefore haven't yet 
considered it.

The deprecation is dated for POI 3.16, where I'd like to remove all those calls.
Not sure if "deprecated by ..." is a good phrase, as we haven't had an 
expiration date on methods yet ... so maybe suggest a different catchy 
tag/phrase.
Maybe something which can be parsed/handled/validated automatically via the ant 
build.
Comment 12 Andreas Beeker 2015-12-14 00:02:10 UTC
Created attachment 33343 [details]
Patch for deprecating packagerelationship calls
Comment 13 Andreas Beeker 2016-01-01 20:44:48 UTC
Created attachment 33392 [details]
2nd version of the relation fix

I've committed already the unrelated changes, i.e.
- removed @SuppressWarnings("deprecation") because of new the xml schemas
- JUnit4 conversion + close resources

This also contains Javens intial comments, i.e.
- javadoc about RelationPart.getRelationship()
- renamed the deprecated tags to "@deprecated in POI 3.14, scheduled for removal in POI 3.16"
Comment 14 Andreas Beeker 2016-01-10 20:48:28 UTC
Patch applied via r1723966