81829 – Document layout incomplete after loadComponentFromURL()

Issue 81829 - Document layout incomplete after loadComponentFromURL()

Summary: Document layout incomplete after loadComponentFromURL()

Status:	CLOSED FIXED

Alias:	None

Product:	App Dev
Classification:	Unclassified
Component:	api (show other issues)
Version:	3.3.0 or older (OOo)
Hardware:	All All

Importance:	P3 Trivial
Target Milestone:	---
Assignee:	michael.brauer
QA Contact:	issues@api

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-09-21 16:56 UTC by mnasato
Modified:	2013-02-24 21:09 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Python script to do programmatic conversions (4.46 KB, text/x-python) 2008-03-01 19:30 UTC, mnasato	no flags	Details
HTML input file to reproduce issue (150.35 KB, text/html) 2008-03-01 19:31 UTC, mnasato	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mnasato 2007-09-21 16:56:09 UTC

Currently after a loadComponentFromURL() the document is loaded but cannot be
trusted to be properly laid out, and there is no way to get notified when the
layout is complete either.

One example when this causes problems is when using OOo as a conversion service,
i.e. loading a document and storing it in a different format.

On the JODConverter forum people frequently complain about a particular document
not being converted correctly when OOo is used programmatically, while
everything works fine when using the OOo GUI manually.

The latest reporter (yesterday) suggested downloading the
http://en.wikipedia.org/wiki/Wiki page as HTML and then converting it to Word
format as a way to reproduce the problem.

The simple Python script at
http://www.artofsolving.com/files/DocumentConverter.py can be also used for testing.

Also a link to a related mailing list discussion:
http://api.openoffice.org/servlets/ReadMsg?list=dev&msgNo=18274

Comment 1 sander_marechal 2007-09-22 07:48:50 UTC

I am suffering from the same issue. When I load a large document using
loadComponentFromURL() and then call update() on all the indexes, the page
numbers come out wrong. In my case, all Table of Contents entries that refer to
an item after page 10 are wrong. If I have my script sleep for 10 seconds
between loadComponentFromURL() and updating the indexes, the Table of Contents
is correct.

Comment 2 Mathias_Bauer 2007-09-27 15:48:24 UTC

I'm not sure if burdening loadComponentFromURL() with tasks oriented to
layouting is a good idea. Actually the fix is easy. If you received the
document, query for the css.util.XRefreshable interface and call refresh() on
it. At least in Writer this will make sure that the layout is complete.

Comment 3 cno 2008-02-05 13:35:45 UTC

@ mnasato / sander:
does mba's suggestion work for you?

Comment 4 mnasato 2008-02-17 17:42:35 UTC

> does mba's suggestion work for you?
>
Not really, JODConverter already does a refresh() after loadComponentFromURL()
and has been doing that for a long time, but users still complain about layout
issues.

But I realise that this issue is a bit vague at the moment, I'll try to find and
attach a document that can be used to reproduce the problem.

Comment 5 mnasato 2008-03-01 18:25:10 UTC

Ok, here's an example reported today on the JODConverter forum

  http://sourceforge.net/forum/forum.php?thread_id=1955854&forum_id=317001

When converting this HTML file

  http://static.springframework.org/spring/docs/1.2.x/reference/beans.html

to PDF with JODConverter, the generated PDF is sometimes incomplete, i.e.
contains e.g. 9 pages out of 40. When opening the HTML file manually in Writer
and exporting it to PDF it contains all the pages.

The number of pages in the incomplete PDF can vary from one attempt to another.
The problem seems to occur more frequently on slower hardware (in fact on faster
machines it may not occur at all). Also it seems to affect Windows more than Linux.

The problem has been discussed on oooforum.org as well, where it has been
confirmed by other people

 http://www.oooforum.org/forum/viewtopic.phtml?p=275942

Comment 6 mnasato 2008-03-01 19:30:16 UTC

Created attachment 51841 [details]
Python script to do programmatic conversions

Comment 7 mnasato 2008-03-01 19:31:29 UTC

Created attachment 51842 [details]
HTML input file to reproduce issue

Comment 8 mnasato 2008-03-01 19:35:44 UTC

Attached both the Python converter script and the input HTML file to this issue
- in case the external URLs disappear.

To reproduce the problem, execute e.g. on Windows

 C:\Program Files\OpenOffice.org 2.3\program\python.bat DocumentConverter.py
beans.html beans.pdf

Comment 9 Mathias_Bauer 2008-03-01 20:24:19 UTC

HTML is a special case; the HTML filter unfortunately still works
asynchronously. As it seems, the original error case also was about HTML, the
answer of sander_marechal confused me as he talked about "long documents", I
should have spotted that before.

HTML documents indeed may not be loaded completely after loadComponentFromURL()
and calling Refresh() won't help as in case of all other documents.

So the only fix is making the HTML import filter synchronous. Michael, for the
time being I assign this issue to you as you had created that filter a long time
ago. We have to find out who might be able to fix that.

Comment 10 Mathias_Bauer 2008-03-01 20:24:50 UTC

Forgot to confirm. :-)

Comment 11 sander_marechal 2008-03-01 23:49:12 UTC

> the answer of sander_marechal confused me as he talked about "long documents"

Sorry about that :-) But indeed, my issue is with long documents. I also use the
same DocumentConverter.py script for headless conversion. My documents aren't
HTML files but ODT files that were programatically altered. I'm using ODF-XSLT
[1] for that, which is similar to mnasato JODReports [2]. In essence I create an
ODT file in OOo, unzip and open content.xml, repeate a certain piece a couple of
times (making the document longer) and the re-zip the whole bunch to create an
ODT file again. Then I use DocumentConverter.py to convert it to PDF.

[1] http://www.jejik.com/odf-xslt
[2] http://www.artofsolving.com/opensource/jodreports

I made one small change to the DocumentConverter.py script. After the refresh()
call I update the indexes to update the page numbering in the table of contents.
As I described in comment #2 these pagenumbers sometimes come out wrong. They
come out with the pagenumbers of the original, short ODT file and not the new,
longer ODT file.

Should I open a new bug for it, or is it still similar enough to be in here?

Comment 12 Mathias_Bauer 2008-03-02 18:44:10 UTC

We have two problems here; one is with the HTML filter and it seems that this
issue started with this case.

So if you have an odt document that doesn't load completely though refresh() was
called *and* DocumentUpdateMode was set to FULL_UPDATE you should open another
issue.

Comment 13 Mathias_Bauer 2008-07-21 11:02:39 UTC

meanwhile the asynchronous problem was fixed in issue 47763. I assume that now
this issue should be fixed also.

@mnasato: the fix was integrated into dev300 m24 but *not* into the beta 2. If
you want, you can test it with a recent developer build >=m24.

Comment 14 mnasato 2009-05-13 22:32:59 UTC

If the HTML filter is no longer asynchronous I think this issue can be closed. I
haven't had any users complaining about HTML conversions for a while.

If there are other issues with other formats it's probably better to report them
separately.

Comment 15 Mathias_Bauer 2009-05-14 08:58:43 UTC

I agree. 
Michael, you can decide: is it "WFM", "Duplicate" or "FIXED"? :-)

Comment 16 michael.brauer 2009-05-15 08:22:30 UTC

I would say "fixed", because the HTML code is synchronous now.

Comment 17 michael.brauer 2009-05-15 08:25:48 UTC

Closed