Issue 110486 - Calc/Writer: data loss on import table from .html file
Summary: Calc/Writer: data loss on import table from .html file
Status: CLOSED DUPLICATE of issue 57176
Alias: None
Product: Calc
Classification: Application
Component: ui (show other issues)
Version: OOO320m14
Hardware: All All
: P2 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-30 10:09 UTC by bormant
Modified: 2023-01-04 20:34 UTC (History)
4 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
test file (82.74 KB, application/x-compressed)
2010-03-30 10:10 UTC, bormant
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description bormant 2010-03-30 10:09:41 UTC
Experiment 1:
1) start OOo
2) File - Open - test.html, HTML Document (OpenOffice.org Calc)
3) document opens in calc (be patient), 
4) navigate to F column, press Ctrl+Down arrow

As for me, column F breaks at row 10923, column A breaks at row 10924 (error).
Columns B and C are complete.

Experiment 2:
1) start OOo
2) File - Open - test.html, All files
3) document opens in writer/web (be patient),
3) navigate to F:10667 in table

As for me, table breaks in column F at row 10667, other data follow the table 
(error).
Comment 1 bormant 2010-03-30 10:10:53 UTC
Created attachment 68632 [details]
test file
Comment 2 bormant 2010-03-30 10:28:20 UTC
Experiment 3:
1) open test.html in browser
2) select and copy whole table
3) open Calc
4) paste (Ctrl+V)

As for me, column F breaks at row 10923, column A breaks at row 10924 (error).
Columns B and C are complete.

If Paste special (Ctrl+Shift+V) - Unformatted text on step 4 -- complete table 
goes in document (without formatting, of course).
Comment 3 helenrussian 2010-03-30 11:32:21 UTC
I confirm experiments 1 and 2 with OOo 3.2 on Linux as described.

In experiment 3 my browser hangs. :(
Comment 4 gquigs 2010-05-21 07:01:55 UTC
Another example file available on launchpad bug:
https://bugs.launchpad.net/ubuntu/+source/openoffice.org/+bug/480130

In this file it cuts off at 6555. This bug appears in Ubuntu's OOo, Go-oo on
Windows and OpenOffice 3.2.1rc.  

Tested all files with both Excel and Firefox, nothing is wrong with the html in
this bug report or launchpads.
Comment 5 gquigs 2010-05-21 07:40:19 UTC
You can create a new document that exhibits this behavior by using Excel:
File up columns 1-4 with data down to row 18,000. Can just be 1-17,999.
Save it as a Web Page from excel.

Open in OpenOffice.  Scroll down to the end, you will notice that instead of
being a table with multiple columns, there is only one column at the bottom
(it's cut off, but the data looks like it has been merged into it. 

Bug still exists in DEV300m77 on Windows.
Comment 6 Olaf Felka 2010-05-21 07:50:19 UTC
This should be better reviewed by the Spreadsheet folks.
Comment 7 damjan 2022-12-27 16:58:07 UTC
Of the pasted cells, only the first 65534 have colored text, the rest are black. So it looks like some 16 bit limit (65534 = 2^16 - 2).

A partial stack trace from what looks like a useful point in the HTML parsing code looks like this, with the modules annotated:

sc:
#0  ScHTMLLayoutParser::NewActEntry(ScEEParseEntry*) (this=this@entry=0x80fcc61d0, pE=0x80e3ea930) at source/filter/html/htmlpars.cxx:202
#1  0x0000000814878d79 in ScHTMLLayoutParser::CloseEntry(ImportInfo*) (this=0x80fcc61d0, pInfo=<optimized out>) at source/filter/html/htmlpars.cxx:743
#2  0x000000081487b14b in ScHTMLLayoutParser::TableDataOff(ImportInfo*) (this=this@entry=0x80fcc61d0, pInfo=0x80e3ea930, pInfo@entry=0x7fffffffc9e0) at source/filter/html/htmlpars.cxx:1032
#3  0x0000000814879aa7 in ScHTMLLayoutParser::ProcToken(ImportInfo*) (this=this@entry=0x80fcc61d0, pInfo=pInfo@entry=0x7fffffffc9e0) at source/filter/html/htmlpars.cxx:1594
#4  0x0000000814879981 in ScHTMLLayoutParser::HTMLImportHdl(ImportInfo*) (this=0x80fcc61d0, pInfo=0x7fffffffc9e0) at source/filter/html/htmlpars.cxx:848
#5  0x0000000814878607 in ScHTMLLayoutParser::LinkStubHTMLImportHdl(void*, void*) (pThis=0x80fcc61d0, pCaller=0x80e3ea930) at source/filter/html/htmlpars.cxx:747

editeng:
#6  0x000000080443dff7 in EditHTMLParser::NextToken(int) (this=0x80e33af10, nToken=695) at source/editeng/eehtml.cxx:514

svtools:
#7  0x0000000801c699e9 in HTMLParser::Continue(int) (this=0x80e33af10, nToken=1450261705) at source/svhtml/parhtml.cxx:362
#8  0x0000000801c69983 in HTMLParser::CallParser() (this=0x80e33af10) at source/svhtml/parhtml.cxx:344

editeng:
#9  0x000000080443db30 in EditHTMLParser::CallParser(ImpEditEngine*, EditPaM const&) (this=0x80e33af10, pImpEE=<optimized out>, rPaM=<optimized out>) at source/editeng/eehtml.cxx:101
#10 0x000000080446f039 in ImpEditEngine::ReadHTML(SvStream&, String const&, EditSelection, SvKeyValueIterator*) (this=this@entry=0x80fb0f810, rInput=..., rBaseURL=..., aSel=..., pHTTPHeaderAttrs=pHTTPHeaderAttrs@entry=0x80fd8ba50)
    at source/editeng/impedit4.cxx:209
#11 0x000000080446eb30 in ImpEditEngine::Read(SvStream&, String const&, EETextFormat, EditSelection, SvKeyValueIterator*) (this=0x80fb0f810, rInput=..., rBaseURL=..., eFormat=EE_FORMAT_HTML, aSel=..., pHTTPHeaderAttrs=0x80fd8ba50)
    at source/editeng/impedit4.cxx:111
#12 0x0000000804425cfb in EditEngine::Read(SvStream&, String const&, EETextFormat, SvKeyValueIterator*) (this=0x80e3762f0, rInput=..., rBaseURL=..., eFormat=EE_FORMAT_HTML, pHTTPHeaderAttrs=0x80fd8ba50)
    at source/editeng/editeng.cxx:1352

sc:
#13 0x0000000814878498 in ScHTMLLayoutParser::Read(SvStream&, String const&) (this=0x80fcc61d0, rStream=..., rBaseURL=...) at source/filter/html/htmlpars.cxx:173
#14 0x0000000814882c37 in ScEEImport::Read(SvStream&, String const&) (this=0x80fec9f00, rStream=..., rBaseURL=...) at source/filter/rtf/eeimpars.cxx:98
#15 0x000000081081d623 in ScImportExport::HTML2Doc(SvStream&, String const&) (this=this@entry=0x7fffffffce40, rStrm=..., rBaseURL=...) at source/ui/docshell/impex.cxx:2013
#16 0x000000081081b543 in ScImportExport::ImportStream(SvStream&, String const&, unsigned long) (this=0x7fffffffce40, rStrm=..., rBaseURL=..., nFmt=<optimized out>) at source/ui/docshell/impex.cxx:472
#17 0x00000008108626c1 in ScViewFunc::PasteDataFormat(unsigned long, com::sun::star::uno::Reference<com::sun::star::datatransfer::XTransferable> const&, short, int, Point*, unsigned char, unsigned char)
    (this=0x80fe36890, nFormatId=51, rxTransferable=..., nPosX=0, nPosY=<optimized out>, pLogicPos=0x0, bLink=0 '\000', bAllowDialogs=1 '\001') at source/ui/view/viewfun5.cxx:319
#18 0x000000081085d144 in ScViewFunc::PasteFromSystem(unsigned long, unsigned char) (this=this@entry=0x80fe36890, nFormatId=nFormatId@entry=51, bApi=bApi@entry=0 '\000') at source/ui/view/viewfun3.cxx:809
#19 0x000000081085b1ee in ScViewFunc::PasteFromSystem() (this=0x80fe36890) at source/ui/view/viewfun3.cxx:657
#20 0x00000008108f39a9 in ScCellShell::PasteFromClipboard(ScViewData*, ScTabViewShell*, bool) (pViewData=0x80fe36898, pTabViewShell=pTabViewShell@entry=0x80fe36810, bShowDialog=<optimized out>) at source/ui/view/cellsh1.cxx:2184
Comment 8 damjan 2022-12-27 18:52:29 UTC
Furthermore in debug build, Ctrl+Alt+Shift+D, enable "MessageBox" for errors, when pasting it gives:

Error: Can't process more than 64K paragraphs!
From File main/editeng/source/editeng/impedit2.cxx at Line 2941

So it's a 16 bit paragraph limit we're hitting.

Attaching gdb and backtracing:

#0  0x000000080080c63a in _poll () at /lib/libc.so.7
#1  0x00000008006c2776 in  () at /lib/libthr.so.3
#2  0x0000000806afd431 in  () at /usr/local/lib/libglib-2.0.so.0
#3  0x0000000806afd558 in g_main_context_iteration () at /usr/local/lib/libglib-2.0.so.0
#4  0x000000080665015b in GtkXLib::Yield(bool, bool) (this=0x8060ca010, bWait=true, bHandleAllCurrentEvents=<optimized out>) at unx/gtk/app/gtkdata.cxx:874
#5  0x00000008029b56ca in ImplYield(bool, bool) (i_bWait=true, i_bAllEvents=false) at source/app/svapp.cxx:476
#6  0x0000000802bc5828 in Dialog::Execute() (this=0x7fffffff8420) at source/window/dialog.cxx:701
#7  0x00000008029a4d99 in SolarMessageBoxExecutor::doIt() (this=<optimized out>) at source/app/dbggui.cxx:1862
#8  0x0000000802bae2a2 in vcl::SolarThreadExecutor::impl_execute(TimeValue const*) (this=0x7fffffff8798, _pTimeout=0x7fffffff87c8) at source/helper/threadex.cxx:103
#9  0x00000008029a4eab in DbgPrintMsgBox(char const*)
    (pLine=0x7fffffff8820 "Error: Can't process more than 64K paragraphs!\nFrom File main/editeng/source/editeng/impedit2.cxx at Line 2941") at source/app/dbggui.cxx:1900
#10 0x0000000800f7984b in DbgOut(char const*, unsigned short, char const*, unsigned short) (pMsg=<optimized out>, nDbgOut=nDbgOut@entry=3, pFile=<optimized out>, nLine=<optimized out>, nLine@entry=2941) at source/debug/debug.cxx:1753
#11 0x0000000804454a7d in ImpEditEngine::ImpInsertParaBreak(EditPaM const&, unsigned char) (this=this@entry=0x81d5b9010, rPaM=..., bKeepEndingAttribs=1 '\001') at source/editeng/impedit2.cxx:2941
#12 0x0000000804455154 in ImpEditEngine::ImpInsertParaBreak(EditSelection const&, unsigned char) (this=0x81d5b9010, rCurSel=..., bKeepEndingAttribs=1 '\001') at source/editeng/impedit2.cxx:2934
#13 0x000000080443e6b3 in EditHTMLParser::ImpInsertParaBreak() (this=this@entry=0x80dc99510) at source/editeng/eehtml.cxx:526
#14 0x000000080443ede0 in EditHTMLParser::EndPara(unsigned char) (this=this@entry=0x80dc99510) at source/editeng/eehtml.cxx:761
#15 0x000000080443e321 in EditHTMLParser::NextToken(int) (this=0x80dc99510, nToken=695) at source/editeng/eehtml.cxx:249
#16 0x0000000801c699e9 in HTMLParser::Continue(int) (this=0x80dc99510, nToken=4) at source/svhtml/parhtml.cxx:362
#17 0x0000000801c69983 in HTMLParser::CallParser() (this=0x80dc99510) at source/svhtml/parhtml.cxx:344
#18 0x000000080443db30 in EditHTMLParser::CallParser(ImpEditEngine*, EditPaM const&) (this=0x80dc99510, pImpEE=<optimized out>, rPaM=<optimized out>) at source/editeng/eehtml.cxx:101
#19 0x000000080446f039 in ImpEditEngine::ReadHTML(SvStream&, String const&, EditSelection, SvKeyValueIterator*) (this=this@entry=0x81d5b9010, rInput=..., rBaseURL=..., aSel=..., pHTTPHeaderAttrs=pHTTPHeaderAttrs@entry=0x813c519e0)
    at source/editeng/impedit4.cxx:209
#20 0x000000080446eb30 in ImpEditEngine::Read(SvStream&, String const&, EETextFormat, EditSelection, SvKeyValueIterator*) (this=0x81d5b9010, rInput=..., rBaseURL=..., eFormat=EE_FORMAT_HTML, aSel=..., pHTTPHeaderAttrs=0x813c519e0)
    at source/editeng/impedit4.cxx:111
#21 0x0000000804425cfb in EditEngine::Read(SvStream&, String const&, EETextFormat, SvKeyValueIterator*) (this=0x813d7acf0, rInput=..., rBaseURL=..., eFormat=EE_FORMAT_HTML, pHTTPHeaderAttrs=0x813c519e0)
    at source/editeng/editeng.cxx:1352
#22 0x000000081328d498 in ScHTMLLayoutParser::Read(SvStream&, String const&) (this=0x81d62ba50, rStream=..., rBaseURL=...) at source/filter/html/htmlpars.cxx:173
#23 0x0000000813297c37 in ScEEImport::Read(SvStream&, String const&) (this=0x813d10500, rStream=..., rBaseURL=...) at source/filter/rtf/eeimpars.cxx:98
#24 0x000000080ea1d623 in ScImportExport::HTML2Doc(SvStream&, String const&) (this=this@entry=0x7fffffffce40, rStrm=..., rBaseURL=...) at source/ui/docshell/impex.cxx:2013
#25 0x000000080ea1b543 in ScImportExport::ImportStream(SvStream&, String const&, unsigned long) (this=0x7fffffffce40, rStrm=..., rBaseURL=..., nFmt=<optimized out>) at source/ui/docshell/impex.cxx:472
#26 0x000000080ea626c1 in ScViewFunc::PasteDataFormat(unsigned long, com::sun::star::uno::Reference<com::sun::star::datatransfer::XTransferable> const&, short, int, Point*, unsigned char, unsigned char)
    (this=0x80af4d890, nFormatId=51, rxTransferable=..., nPosX=0, nPosY=<optimized out>, pLogicPos=0x0, bLink=0 '\000', bAllowDialogs=1 '\001') at source/ui/view/viewfun5.cxx:319
#27 0x000000080ea5d144 in ScViewFunc::PasteFromSystem(unsigned long, unsigned char) (this=this@entry=0x80af4d890, nFormatId=nFormatId@entry=51, bApi=bApi@entry=0 '\000') at source/ui/view/viewfun3.cxx:809
#28 0x000000080ea5b1ee in ScViewFunc::PasteFromSystem() (this=0x80af4d890) at source/ui/view/viewfun3.cxx:657
#29 0x000000080eaf39a9 in ScCellShell::PasteFromClipboard(ScViewData*, ScTabViewShell*, bool) (pViewData=0x80af4d898, pTabViewShell=pTabViewShell@entry=0x80af4d810, bShowDialog=<optimized out>) at source/ui/view/cellsh1.cxx:2184
Comment 9 damjan 2022-12-29 04:46:40 UTC
Even the container classes like ContentList and its child class EditDoc, are capped to 16 bit indexes and a maximum size of 2^16 elements, as they are svl's "PTRARR" type classes which imposes those limitations.
Comment 10 damjan 2023-01-03 03:14:26 UTC
This was originally reported as bug 57176, so closing DUPLICATE. Thank you for your bug report and useful sample file. A fix is in progress.

*** This issue has been marked as a duplicate of issue 57176 ***