Apache OpenOffice (AOO) Bugzilla – Issue 81829
Document layout incomplete after loadComponentFromURL()
Last modified: 2013-02-24 21:09:29 UTC
Currently after a loadComponentFromURL() the document is loaded but cannot be trusted to be properly laid out, and there is no way to get notified when the layout is complete either. One example when this causes problems is when using OOo as a conversion service, i.e. loading a document and storing it in a different format. On the JODConverter forum people frequently complain about a particular document not being converted correctly when OOo is used programmatically, while everything works fine when using the OOo GUI manually. The latest reporter (yesterday) suggested downloading the http://en.wikipedia.org/wiki/Wiki page as HTML and then converting it to Word format as a way to reproduce the problem. The simple Python script at http://www.artofsolving.com/files/DocumentConverter.py can be also used for testing. Also a link to a related mailing list discussion: http://api.openoffice.org/servlets/ReadMsg?list=dev&msgNo=18274
I am suffering from the same issue. When I load a large document using loadComponentFromURL() and then call update() on all the indexes, the page numbers come out wrong. In my case, all Table of Contents entries that refer to an item after page 10 are wrong. If I have my script sleep for 10 seconds between loadComponentFromURL() and updating the indexes, the Table of Contents is correct.
I'm not sure if burdening loadComponentFromURL() with tasks oriented to layouting is a good idea. Actually the fix is easy. If you received the document, query for the css.util.XRefreshable interface and call refresh() on it. At least in Writer this will make sure that the layout is complete.
@ mnasato / sander: does mba's suggestion work for you?
> does mba's suggestion work for you? > Not really, JODConverter already does a refresh() after loadComponentFromURL() and has been doing that for a long time, but users still complain about layout issues. But I realise that this issue is a bit vague at the moment, I'll try to find and attach a document that can be used to reproduce the problem.
Ok, here's an example reported today on the JODConverter forum http://sourceforge.net/forum/forum.php?thread_id=1955854&forum_id=317001 When converting this HTML file http://static.springframework.org/spring/docs/1.2.x/reference/beans.html to PDF with JODConverter, the generated PDF is sometimes incomplete, i.e. contains e.g. 9 pages out of 40. When opening the HTML file manually in Writer and exporting it to PDF it contains all the pages. The number of pages in the incomplete PDF can vary from one attempt to another. The problem seems to occur more frequently on slower hardware (in fact on faster machines it may not occur at all). Also it seems to affect Windows more than Linux. The problem has been discussed on oooforum.org as well, where it has been confirmed by other people http://www.oooforum.org/forum/viewtopic.phtml?p=275942
Created attachment 51841 [details] Python script to do programmatic conversions
Created attachment 51842 [details] HTML input file to reproduce issue
Attached both the Python converter script and the input HTML file to this issue - in case the external URLs disappear. To reproduce the problem, execute e.g. on Windows C:\Program Files\OpenOffice.org 2.3\program\python.bat DocumentConverter.py beans.html beans.pdf
HTML is a special case; the HTML filter unfortunately still works asynchronously. As it seems, the original error case also was about HTML, the answer of sander_marechal confused me as he talked about "long documents", I should have spotted that before. HTML documents indeed may not be loaded completely after loadComponentFromURL() and calling Refresh() won't help as in case of all other documents. So the only fix is making the HTML import filter synchronous. Michael, for the time being I assign this issue to you as you had created that filter a long time ago. We have to find out who might be able to fix that.
Forgot to confirm. :-)
> the answer of sander_marechal confused me as he talked about "long documents" Sorry about that :-) But indeed, my issue is with long documents. I also use the same DocumentConverter.py script for headless conversion. My documents aren't HTML files but ODT files that were programatically altered. I'm using ODF-XSLT [1] for that, which is similar to mnasato JODReports [2]. In essence I create an ODT file in OOo, unzip and open content.xml, repeate a certain piece a couple of times (making the document longer) and the re-zip the whole bunch to create an ODT file again. Then I use DocumentConverter.py to convert it to PDF. [1] http://www.jejik.com/odf-xslt [2] http://www.artofsolving.com/opensource/jodreports I made one small change to the DocumentConverter.py script. After the refresh() call I update the indexes to update the page numbering in the table of contents. As I described in comment #2 these pagenumbers sometimes come out wrong. They come out with the pagenumbers of the original, short ODT file and not the new, longer ODT file. Should I open a new bug for it, or is it still similar enough to be in here?
We have two problems here; one is with the HTML filter and it seems that this issue started with this case. So if you have an odt document that doesn't load completely though refresh() was called *and* DocumentUpdateMode was set to FULL_UPDATE you should open another issue.
meanwhile the asynchronous problem was fixed in issue 47763. I assume that now this issue should be fixed also. @mnasato: the fix was integrated into dev300 m24 but *not* into the beta 2. If you want, you can test it with a recent developer build >=m24.
If the HTML filter is no longer asynchronous I think this issue can be closed. I haven't had any users complaining about HTML conversions for a while. If there are other issues with other formats it's probably better to report them separately.
I agree. Michael, you can decide: is it "WFM", "Duplicate" or "FIXED"? :-)
I would say "fixed", because the HTML code is synchronous now.
Closed