Bug 47668 - OOXML is parsed as tree, but PPTX is a graph
Summary: OOXML is parsed as tree, but PPTX is a graph
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.5-dev
Hardware: All All
: P2 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2009-08-09 23:35 UTC by Stefan Stern
Modified: 2009-08-17 03:16 UTC (History)
0 users

Suggested Patch on trunk revision 802651 (3.71 KB, patch)
2009-08-10 00:23 UTC, Stefan Stern
Details | Diff
Eclipse project with a small testcase (204.10 KB, application/zip)
2009-08-10 00:26 UTC, Stefan Stern

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Stern 2009-08-09 23:35:58 UTC
The 'POIXMLDocumentPart' and 'POIXMLDocument' parse an OOXML document by seeking the main part of the OOXML, represented by an instance of a subclass of 'POIXMLDocument', and then invoke the method "read(POIXMLFactory factory)" recursively on all relations to other PackageParts. 

This works fine in Excel and Word files, as these seem to be trees to a far extent. In PowerPoint, the Slide, SlideLayout and SlideMaster form a graph. The Slides have a relationship to the SlideLayout. SlideLayout has a relationship to the SlideMaster. SlideMaster has a relationship to all SlideLayouts. And the presentation has relations to all Slides and SlideMaster classes. 

When using the existing classes in my current XSLF-Implementation, I end up in an endless loop. The only option to avoid this, is to pass a context object to the loading classes, where all loaded PackagePart and their corresponding XSLF-classes are chached. This allows to avoid any loops and every POIXML-instance can be linked with its related parts. 

Storing is analogue, although a Set is sufficient to prevent endless loops. When storing the document, a Set is passed as context. When invoking all related parts recursively, only those not yet referenced by the Set are stored. 

The change in this behavior implies additional changes: Images used in multiple places refernce from several places to the same PackagePart. There is no 1:1 mapping from appearance in the document and existence of PackageParts. Currently, every relation ends up in a separate instance each time an image is referenced, but all images point to the same PackagePart. This may cause weired behavior when saving the document. 

Example: You have an Excel xlsx-file. This contains two sheets. On each sheet, you have the same JPG image inserted. If you look into the file, you will see that both sheet-object refer to the very same image-PackagePart. But in POI, these references end up in two instances of XSSFPicture. But these two instances refer to the same PackagePart and yet represent two different objects in Excel. In case one of the images is altered, depending on the store-sequence you may encounter an effect to the other image as well or your modification is not persisted at all. 

See the patch attached to this bug to enable a 1:1 mapping between PackageParts and POIXMLDocumentPart-objects. I will try to come up with a test-case for the Excel example.
Comment 1 Stefan Stern 2009-08-10 00:23:17 UTC
Created attachment 24120 [details]
Suggested Patch on trunk revision 802651
Comment 2 Stefan Stern 2009-08-10 00:26:02 UTC
Created attachment 24121 [details]
Eclipse project with a small testcase

This Project contains a testcase. As I was not able to find POI API to modify the XSSFPictureData, the testcases purpose is to show that there are several parts pointing to the same Picture, but  below is the very same PackagePart. Once there is API to modify the Package, the underlying code must be aware that manipulating the image might affect other parts.
Comment 3 Yegor Kozlov 2009-08-12 12:03:00 UTC
fixed in r803667 with some tweaks:

 - POIXMLDocument.load is protected and final. User should neither override it nor call via usermodel API, this method should only be called when constructing documents. I made XSSFWorkbook and XWPFDocument use it.
 - entries in the context are keyed by PackageRelationship, not by PackagePart. PackageRelationship is more appropriate as it overrides hashCode() and equals()
 - removed the old context-less onSave() and write()
 - added the unit test to TestXSSFWorkbook and also created org.apache.poi.TestPOIXMLDocument to test generic parsing / saving of OOXML packages

Comment 4 Stefan Stern 2009-08-14 06:31:44 UTC
>  - entries in the context are keyed by PackageRelationship, not by PackagePart.
> PackageRelationship is more appropriate as it overrides hashCode() and equals()

PackageRelationship.hashCode method makes use of the ID of the relationship. This will most likely always result in different hashes for different PackageRelationship-objects, although these refer to the same PackagePart. 

In the context-map and context-set for loading and writing, the hashcode is used when adding and checking whether element is already contained in list. 

Result is a very similiar behavior then before: each PackagePart has several instances in memory, means I can find more than one XSLFSlide object for the same PackagePart '/ppt/slide/slide1.xml', each one referenced by another part (notesPart, slideMasterPart, etc). 

The only benefit now is, that each PackageRelationship-instance is added only once, so there is not an endless-loop any more. 

Is there a major benefit in using PackageRelationship instead of the PackagePart itself? Or is there a need to instantiate more than one XSLFSLide object for the very same Slide-PackagePart?
Comment 5 Yegor Kozlov 2009-08-14 10:31:55 UTC
I see.
Use of PackageRelationship was a wrong decision, I changed it back to use PackagePart as keys. 

I also improved TestPOIXMLDocument to assert that same logical parts correspond to the same physical instances of POIXMLDocumentPart

Comment 6 Stefan Stern 2009-08-17 03:16:46 UTC
ok, works fine now :)