Issue 8810

Summary:	Enhance XMerge to allow access to embedded objects in OpenOffice XML files.
Product:	xml	Reporter:	Unknown <non-migrated>
Component:	smalldevices	Assignee:	Unknown <non-migrated>
Status:	CLOSED FIXED	QA Contact:	issues@xml <issues>
Severity:	Trivial
Priority:	P3	CC:	issues
Version:	current
Target Milestone:	---
Hardware:	All
OS:	All
Issue Type:	ENHANCEMENT	Latest Confirmation in:	---
Developer Difficulty:	---

Description Unknown 2002-10-29 16:10:44 UTC

The XMerge API needs to be enhanced to provide read/write access to embedded 
objects within an OpenOffice.org XML file.  This will allow XMerge to be used 
for conversions of richer documents than a PDA can handle.

For further information, see the the thread on dev@xml.openoffice.org started 
by Henrik Just on 18th October 2002, entitled "Using xmerge to convert rich 
document formats."

The changes will take the form of adding an abstract EmbeddedObject class to 
the org.openoffice.xmerge.converter.xml package.  There will be two concrete 
classes, EmbeddedBinaryObject and EmbeddedXMLObject to represent the two types 
of embedded object allowed in an OpenOffice.org XML file (as of XML File Format 
Specification 1.0).

Comment 1 Unknown 2002-10-29 16:11:49 UTC

Changes are mostly complete.  Will use this bug to track changes made 
to the XMerge API.

Comment 2 Unknown 2002-10-29 16:21:41 UTC

EmbeddedObject defines accessor methods for the data of the embedded 
object as well as the name/path (within the manifest.xml file) and 
MIME type of the object.  

A number of package private methods also exist to interact with the 
OfficeZip and OfficeDocument classes for storage purposes.

Note that flat OpenOffice.org XML files store embedded objects as 
inline tags/data within the document structure.  The EmbeddedObject 
class and its subclasses are intended to represent embedded objects 
as stored in the zipped OpenOffice.org file format.

Comment 3 Unknown 2002-10-29 16:26:39 UTC

Retrieval of both EmbeddedObject information and the data for each 
EmbeddedObject is deferred until specifically called via provided 
methods.

This incurs a performance penalty when first accessing data, but 
ensures that no performance degradation occurs where embedded object 
data is not a concern.


In order to support the retrival of data, two new public methods have 
been added to OfficeDocument.  The first returns an Iterator of all 
the embedded objects in the document.  The second returns a specific 
EmbeddedObject instance representing a named object.

An object name can be found from the xlink:href attribute for an 
embedded object in a document's content tree.

Comment 4 Unknown 2002-10-30 16:20:45 UTC

Tested read and write functionality.  Can successfully read and write 
embedded objects when converting.

Tests on existing plugins show no impact on existing XMerge 
functionality.

All changes now committed.

Comment 5 henrikjust 2002-11-03 15:23:21 UTC

There is a small issue: The code to disable processing the DTD doesn't
work with Crimson as a parser.
Here is a simple fix: In the method "getNamedDOM" in EmbeddedXMLDocument,
    return builder.parse(domData);
can be replaced with
    InputSource is = new InputSource(domData);
    is.setSystemId("");
    return builder.parse(is);
Also, OfficeDocument uses another trick to avoid reading the DTD (the
method "hack"). This code doesn't work with non-ASCII characters (it
doesn't translate from utf-8); to fix that, it should be replaced by
the same code as in EmbeddedXMLDocument.

Comment 6 henrikjust 2002-11-03 21:34:31 UTC

Another detail: There is some confusion with trailing "/" for embedded
objects: In manifest.xml an XML object is named with a trailing "/"
(because it is a directory in the zip file). A binary object does not
have a trailing "/" (since it is a file in the zip file).
The method getEmbeddedObject(String name) in OfficeDocument uses the
name from manifest.xml.

But in the xlink:href attributes as well as in EmbeddedObject objects,
there is never a trailing "/" in the name.
So I think the most consistent solution would be not to require the
trailing "/" in getEmbeddedObject.

Comment 7 Unknown 2002-11-04 10:57:11 UTC

The trailing '/' character should not be required for 
getEmbeddedObject.  When the objects are being read in, any trailing 
character is chopped off.  See getEmbeddedObjects().

The documentation for getEmbeddedObject also states that any '/' 
or '#' characters should be stripped.  These are the extras that 
appear in the xlink:href entry.

Comment 8 Unknown 2002-11-05 16:40:37 UTC

Fixed the problem with the trailing '/' character.  Also amended the 
hack() method of OfficeDocument to read the byte stream as UTF-8.  
This resolves the issue of searching for a DTD.  The previous 
approach, to use an EntityResolver, did not work consistently on all 
parsers.

Comment 9 Unknown 2002-11-06 11:29:42 UTC

Henrik's development and testing indicates that the changes work as 
they should.  Internal testing shows no regressions.

Henrik's e-mail:

Hi Mark

I've tested the latest version of OfficeDocument and 
EmbeddedXMLDocument.
Everything seems to be perfect! - I have no trouble extracting 
formulas
and graphics from a Writer document.

Thanks again!
Henrik


Closing this bug.