Apache OpenOffice (AOO) Bugzilla – Issue 96958
Problem in importing big HTML files in Writer 3.0
Last modified: 2023-01-03 13:31:23 UTC
Writer in OpenOffice 3.0 on Windows VISTA has many problems to open big HTML files (> ~10 Mbyte). Sometime Writer hangs and the opening time is too much! Writer in OpenOffice 2.4.1 doesn't have these kind of problems.
10 Mb of pure (text) HTML????!!! Please upload your document to a free Internet storage site and post the link here.
Ok. This is a rebuilded HTML of 15 Mbyte. http:\\www.robertobattistoni.it\test.rar Note: on this test file, Writer 3.0 doesn't hung but it use a lot of time to open the test file. Once it has opened the file the navigation through the file is very difficult. With another HTML test file Writer hangs, but I cannot give you this HTML because it contains confidential information.
Well, indeed it's a pure HTML document which is bigger than 16 Mb and has 338339 lines! No wonder that it takes time to load! My tests have given that Firefox takes also very long to load it (after 9'30 I interrupted the load process), MS Word displays an error in *.css reference and loads only 6 pages of text. I wanted to close as WONTFIX because it's an extreme case which simply reaches the limit of the software capacity. But, yes, OOo 2.4.1 loads it at least as "fast" as Firefox. @AMA: what have we changed between 2.4.1 and 3.0 in this area?
The reliablity of OOo 3.0 in opening this kind of files could be a serious problem when you want to use OOo as a conversion server (via UNO). It's not a problem of loading time only because in same cases OOo 3.0 halts and kill itself (only for 3.0. in 2.4.1 and 2.4.2 it works). My test file has some problem in the HTML format because it's not well formed and doesn't reproduce perfectly the problem I had. I'll change the file and I'll resend you ASAP. I think it's normal that Firefox, as a browser, has some difficulty to open very big file. But this shouldn't happen for a Text Editor. The first one is a browser and has timeout limits, the second one should load big files too. The message Word 2007 shows you on CSS missing is not a problem in a conversion process using Word API. I don't want to promote Word vs. OOo battle but in this case Word 2007 works perfectly as a conversion server and in the previous version OOo loads very fast big HTML files (but I'd like to use OSS). It seems that in the older version of OOo loads the file step by step because after few seconds you can see the file loaded in Writer (< 3.0). Instead Writer 3.0 seems to load the *entire* file in memory before showing you.
I confirm the problem with importing large HTML files. I use OpenOffice 3.2 under Windows XP as a conversion service to PDF. My file isn't so big like file mentioned above. It has about 10000 lines and its size is about 388 KB. File contains table with 97 wors and 33 columns. 1. When I try to load this HTML file using UNO, OpenOffice hangs. 2. When I try to load this HTML file in Writer using File -> Open - OpenOffice also hangs. By "hangs" I mean using about 50% CPU time and approximately 100 MB of memory (memory usage is going up in time of working) for several minutes - afer about 7 minutes I killed OpenOffice process.
Created attachment 68141 [details] HTML file with table
I have experimented regarding this problem (Using OO 3.2.0 Build 9483 on Windows XP SP3 Box) and found that HTML files (with large tables) up to around 200 KB seem to be OK but over this size they may load but when the Writer window is sized or maximized it hangs using up to 100% CPU. Large html tables seem to be the problem. (if the file does not have a large table it seems to be fine.)
Calc opens the file perfectly, but Writer hangs in an infinite loop. Attaching a debugger and backtracing a few times, I saw it's often running code in this function from main/sw/source/core/doc/docbm.cxx: ---snip--- ::rtl::OUString MarkManager::getUniqueMarkName(const ::rtl::OUString& rName) const { OSL_ENSURE(rName.getLength(), "<MarkManager::getUniqueMarkName(..)> - a name should be proposed"); if ( findMark(rName) == getAllMarksEnd() ) { return rName; } ::rtl::OUStringBuffer sBuf; ::rtl::OUString sTmp; for(sal_Int32 nCnt = 1; nCnt < SAL_MAX_INT32; nCnt++) { sTmp = sBuf.append(rName).append(nCnt).makeStringAndClear(); if ( findMark(sTmp) == getAllMarksEnd() ) { break; } } return sTmp; } ---snip--- That "for" loop has a limit of SAL_MAX_INT32 (over 2 billion), and the condition that would cause it to "break" seems to never be met, thus it just spins there. Putting a breakpoint on that "if" statement within the "for" loop and printing the contents of "sTmp" on each loop run, I get: __tmpTD1547 __tmpTD1548 __tmpTD1549 ... and the "break" is never reached.
Reproduced on FreeBSD PC so changing hardware to "All", setting latest confirmation version, and clearing "needmoreinfo".
what is the return value of getAllMarksEnd()?
(In reply to Peter from comment #10) > what is the return value of getAllMarksEnd()? IDocumentMarkAccess::const_iterator_t MarkManager::getAllMarksEnd() const { return m_vAllMarks.end(); } where OpenGrok tell us ::sw::mark::MarkManager's m_vAllMarks is defined in main/sw/source/core/inc/MarkManager.hxx as: 101 // container for all marks 102 container_t m_vAllMarks; and container_t is defined in main/sw/inc/IDocumentMarkAccess.hxx as: 60 typedef ::std::vector< pMark_t > container_t;
(In reply to damjan from comment #8) > Calc opens the file perfectly, but Writer hangs in an infinite loop. If "the file" above means a file from the archive attached to this report ("html_table.html"), Writer opens it on my laptop after about 4 to 5 minutes.
(In reply to Czesław Wolański from comment #12) > (In reply to damjan from comment #8) > > Calc opens the file perfectly, but Writer hangs in an infinite loop. > > If "the file" above means a file from the archive > attached to this report ("html_table.html"), > Writer opens it on my laptop after about 4 to 5 minutes. That must be an awesome laptop then, it takes closer to 25 minutes on my PC. Nice find, thank you. So it's a performance bug then. Maybe we should replace that ::std::vector with a dictionary of some kind, ::std::map or whatever? Or it could also be something in the calling code that does too many lookups.
(In reply to damjan from comment #13) > > That must be an awesome laptop then, it takes closer to 25 minutes on my PC. > Acer Aspire A317-53 (Intel Core i7 @ 2.80 GHz, 16 GB RAM, 512 GB SSD) Windows 11 Home Nothing out of the ordinary, I guess.