Issue 86650 - Slow loading of very large HTML files with WebQuery filter
Summary: Slow loading of very large HTML files with WebQuery filter
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: DEV300m1
Hardware: All All
: P3 Trivial with 4 votes (vote)
Target Milestone: ---
Assignee: joerg.skottke
QA Contact: issues@sc
Depends on:
Reported: 2008-03-03 15:33 UTC by niklas.nebel
Modified: 2013-08-07 15:14 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---

Large HTML file (zipped) (532.50 KB, application/x-compressed)
2008-03-03 15:34 UTC, niklas.nebel
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description niklas.nebel 2008-03-03 15:33:05 UTC
Loading the attached file takes several hours. Apparenty, much of the time is
spent in ScHTMLTable::InsertNewCell / ScRangeList::Find, which can probably be
Comment 1 niklas.nebel 2008-03-03 15:34:56 UTC
Created attachment 51866 [details]
Large HTML file (zipped)
Comment 2 daniel.rentz 2008-03-03 15:53:45 UTC
Comment 3 froyshov 2009-04-16 21:15:49 UTC
This has been a defect since OOo 1.x until now. Both me and my wife are 
hindered in our work because of this. My wife just spent 5 hours to try to open 
a file on 8.500 KB before she gave up! And the processor works flat out 100% so 
you cannot use the computer for anything else. After 5 hours soffice.exe was 
using over 200MB of RAM pluss over 200MB of virtual RAM.

This file is from a database with 6000 members that she wanted to open.

Workaround: Use Microsoft Office Excel - opens in 7 seconds!
Comment 4 daniel.rentz 2009-04-29 19:32:37 UTC
Any document containing more than 65535 text paragraphs (e.g. table cells) will
be cropped at this limit. This limitation is caused by the underlying EditEngine
used to import HTML. The filter will import 65535 cells and silently drop
anything else. The attached document contains 64000 rows and 15 cells per row.
This means, only the first ~4370 rows of the document will be loaded.
Performance can be improved significantly by using several range lists to track
merged cells and used table area, instead of one range list that is used for all
(member ScHTMLTable::maLockList).
Comment 5 daniel.rentz 2009-05-06 15:27:27 UTC
loading time reduced from several hours to 7 minutes. All O(n^2) algorithms have
been replaced by something more appropriate. See also issue 100827.
Comment 6 daniel.rentz 2009-06-09 09:46:43 UTC
back to QA
Comment 7 joerg.skottke 2009-07-01 08:39:35 UTC
I measured even better times in a VM - around 70 seconds for the 48k_rows
document using the web query filter. Unsing the HTML filter takes a lifetime.

Nice improvement!
Comment 8 amy2008 2009-07-17 07:08:52 UTC
I have tried to open the attachment in DEV300m52 on WinXP, but failed to open
the file.
Comment 9 joerg.skottke 2009-10-09 08:11:34 UTC