Issue 96958 - Problem in importing big HTML files in Writer 3.0
Summary: Problem in importing big HTML files in Writer 3.0
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: open-import (show other issues)
Version: OOo 3.0
Hardware: PC All
: P3 Trivial with 7 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: oooqa, performance
Depends on:
Blocks:
 
Reported: 2008-12-05 16:06 UTC by rbattistoni
Modified: 2023-01-03 13:31 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---


Attachments
HTML file with table (3.27 KB, text/plain)
2010-03-04 09:08 UTC, pawo509
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description rbattistoni 2008-12-05 16:06:39 UTC
Writer in OpenOffice 3.0 on Windows VISTA has many problems to open big HTML
files (> ~10 Mbyte). Sometime Writer hangs and the opening time is too much!

Writer in OpenOffice 2.4.1 doesn't have these kind of problems.
Comment 1 eric.savary 2008-12-05 16:53:42 UTC
10 Mb of pure (text) HTML????!!!

Please upload your document to a free Internet storage site and post the link here.
Comment 2 rbattistoni 2008-12-05 17:20:23 UTC
Ok. This is a rebuilded HTML of 15 Mbyte.  

http:\\www.robertobattistoni.it\test.rar

Note: on this test file, Writer 3.0 doesn't hung but it use a lot of time to
open the test file. Once it has opened the file the navigation through the file
is very difficult. With another HTML test file Writer hangs, but I cannot give
you this HTML because it contains confidential information.

Comment 3 eric.savary 2008-12-05 23:36:28 UTC
Well, indeed it's a pure HTML document which is bigger than 16 Mb and has 338339
lines! No wonder that it takes time to load!

My tests have given that Firefox takes also very long to load it (after 9'30 I
interrupted the load process), MS Word displays an error in *.css reference and
loads only 6 pages of text.

I wanted to close as WONTFIX because it's an extreme case which simply reaches
the  limit of the software capacity. But, yes, OOo 2.4.1 loads it at least as
"fast" as Firefox.

@AMA: what have we changed between 2.4.1 and 3.0 in this area?
Comment 4 rbattistoni 2008-12-06 07:28:34 UTC
The reliablity of OOo 3.0 in opening this kind of files could be a serious
problem when you want to use OOo as a conversion server (via UNO). It's not a
problem of loading time only because in same cases OOo 3.0 halts and kill itself
(only for 3.0. in 2.4.1 and 2.4.2 it works).

My test file has some problem in the HTML format because it's not well formed
and doesn't reproduce perfectly the problem I had. I'll change the file and I'll
resend you ASAP.

I think it's normal that Firefox, as a browser, has some difficulty to open very
big file. But this shouldn't happen for a Text Editor. The first one is a
browser and has timeout limits, the second one should load big files too.

The message Word 2007 shows you on CSS missing is not a problem in a conversion
process using Word API. I don't want to promote Word vs. OOo battle but in this
case Word 2007 works perfectly as a conversion server and in the previous
version OOo loads very fast big HTML files (but I'd like to use OSS).

It seems that in the older version of OOo loads the file step by step because
after few seconds you can see the file loaded in Writer (< 3.0). Instead Writer
3.0 seems to load the *entire* file in memory before showing you.


Comment 5 pawo509 2010-03-04 09:07:26 UTC
I confirm the problem with importing large HTML files. I use OpenOffice 3.2
under Windows XP as a conversion service to PDF.

My file isn't so big like file mentioned above. It has about 10000 lines and its
size is about 388 KB. File contains table with 97 wors and 33 columns.
1. When I try to load this HTML file using UNO, OpenOffice hangs.
2. When I try to load this HTML file in Writer using File -> Open - OpenOffice
also hangs.

By "hangs" I mean using about 50% CPU time and approximately 100 MB of memory
(memory usage is going up in time of working) for several minutes - afer about 7
minutes I killed OpenOffice process.
Comment 6 pawo509 2010-03-04 09:08:23 UTC
Created attachment 68141 [details]
HTML file with table
Comment 7 pyrix 2010-06-25 00:19:35 UTC
I have experimented regarding this problem (Using OO 3.2.0 Build 9483 on 
Windows XP SP3 Box) and found that HTML files (with large tables) up to around 
200 KB seem to be OK but over this size they may load but when the Writer 
window is sized or maximized it hangs using up to 100% CPU. Large html tables 
seem to be the problem. (if the file does not have a large table it seems to be 
fine.)
Comment 8 damjan 2023-01-02 19:10:13 UTC
Calc opens the file perfectly, but Writer hangs in an infinite loop. Attaching a debugger and backtracing a few times, I saw it's often running code in this function from main/sw/source/core/doc/docbm.cxx:

---snip---
::rtl::OUString MarkManager::getUniqueMarkName(const ::rtl::OUString& rName) const
{
    OSL_ENSURE(rName.getLength(),
        "<MarkManager::getUniqueMarkName(..)> - a name should be proposed");
    if ( findMark(rName) == getAllMarksEnd() )
    {
        return rName;
    }

    ::rtl::OUStringBuffer sBuf;
    ::rtl::OUString sTmp;
    for(sal_Int32 nCnt = 1; nCnt < SAL_MAX_INT32; nCnt++)
    {
        sTmp = sBuf.append(rName).append(nCnt).makeStringAndClear();
        if ( findMark(sTmp) == getAllMarksEnd() )
        {
            break;
        }
    }
    return sTmp;
}
---snip---

That "for" loop has a limit of SAL_MAX_INT32 (over 2 billion), and the condition that would cause it to "break" seems to never be met, thus it just spins there.

Putting a breakpoint on that "if" statement within the "for" loop and printing the contents of "sTmp" on each loop run, I get:

__tmpTD1547
__tmpTD1548
__tmpTD1549
...

and the "break" is never reached.
Comment 9 damjan 2023-01-02 19:12:48 UTC
Reproduced on FreeBSD PC so changing hardware to "All", setting latest confirmation version, and clearing "needmoreinfo".
Comment 10 Peter 2023-01-02 21:05:26 UTC
what is the return value of getAllMarksEnd()?
Comment 11 damjan 2023-01-03 02:29:38 UTC
(In reply to Peter from comment #10)
> what is the return value of getAllMarksEnd()?

    IDocumentMarkAccess::const_iterator_t MarkManager::getAllMarksEnd() const
        { return m_vAllMarks.end(); }

where OpenGrok tell us ::sw::mark::MarkManager's m_vAllMarks is defined in main/sw/source/core/inc/MarkManager.hxx as:

101              // container for all marks
102              container_t m_vAllMarks;

and container_t is defined in main/sw/inc/IDocumentMarkAccess.hxx as:

60          typedef ::std::vector< pMark_t > container_t;
Comment 12 Czesław Wolański 2023-01-03 11:23:01 UTC
(In reply to damjan from comment #8)
> Calc opens the file perfectly, but Writer hangs in an infinite loop.

If "the file" above means a file from the archive
attached to this report ("html_table.html"),
Writer opens it on my laptop after about 4 to 5 minutes.
Comment 13 damjan 2023-01-03 12:50:10 UTC
(In reply to Czesław Wolański from comment #12)
> (In reply to damjan from comment #8)
> > Calc opens the file perfectly, but Writer hangs in an infinite loop.
> 
> If "the file" above means a file from the archive
> attached to this report ("html_table.html"),
> Writer opens it on my laptop after about 4 to 5 minutes.

That must be an awesome laptop then, it takes closer to 25 minutes on my PC.

Nice find, thank you.

So it's a performance bug then. Maybe we should replace that ::std::vector with a dictionary of some kind, ::std::map or whatever? Or it could also be something in the calling code that does too many lookups.
Comment 14 Czesław Wolański 2023-01-03 13:31:23 UTC
(In reply to damjan from comment #13)
>
> That must be an awesome laptop then, it takes closer to 25 minutes on my PC.
> 
Acer Aspire A317-53 (Intel Core i7 @ 2.80 GHz, 16 GB RAM, 512 GB SSD)
Windows 11 Home

Nothing out of the ordinary, I guess.