Issue 107867 - Plain CSV import erroneous because of incorrect row termination
Summary: Plain CSV import erroneous because of incorrect row termination
Status: CLOSED DUPLICATE of issue 78926
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: OOo 3.1.1
Hardware: PC Windows XP
: P3 Trivial (vote)
Target Milestone: ---
Assignee: spreadsheet
QA Contact: issues@sc
URL:
Keywords: oooqa
Depends on:
Blocks:
 
Reported: 2009-12-24 03:27 UTC by tim_c
Modified: 2010-01-07 15:53 UTC (History)
3 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
zip of csv file (211.24 KB, application/octet-stream)
2009-12-24 13:20 UTC, tim_c
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description tim_c 2009-12-24 03:27:05 UTC
A commonly reported problem is strange CSV import troubles which elicit a too 
many rows to import error when the number of rows is far short of the actual 
limit.

I created a large CSV file, very simple straight CSV, comma separated, any text 
field " delimited and with a " quoted header line. EOL is crlf, same if I use 
just lf. No strangeness, no character encoding, old fashioned upper case plain 
ascii.
Tried various import character encodings, no effect.
Exclude header row, no effect.

I have discovered a major clue on what is going on. I did a CSV import, it read 
maybe 5,000 out of 7,500 rows. I then exported the CSV to try and discover what 
it thought CSV ought to look like.

On looking with a binary capable editor I discover it thought there was a 
massive numnber of **columns**, exported data is fine but appended to each row 
are many ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
On looking I then see it had adjusted many column widths on CSV import.

As a hint the import CSV was 675k, mostly imported. Exporting that to a new CSV 
produced a 6.92M file, mostly commas. 

Gnumeric imports the original CSV without bothering with import dialog. Various 
other software imports without problem. Even if it is bad CSV it ought to be 
handled gracefully.

If a developer wants a test file, please email me. (don't want the file public)
Comment 1 Rainer Bielefeld 2009-12-24 07:47:00 UTC
@tim_c:
Please attach sample documents, you can also send to me a document by personal
mail, if it's too big to be attached!
Comment 2 tim_c 2009-12-24 13:20:09 UTC
Created attachment 66794 [details]
zip of csv file
Comment 3 tim_c 2009-12-24 13:31:34 UTC
Test CSV file sent. 

7281 rows including header row
13 columns, he says superstitiously

If this does not induce the problem I expect I can produce variants because it 
was programatically created here. 
Comment 4 Rainer Bielefeld 2009-12-24 15:58:39 UTC
Reproducible with sample document and "Ooo 3.2.0 RC1 WIN XP DE-multilingual
version German UI activated [OOo320m8 (Build 9472)]", also with "2.4.1 
Multilingual version English UI WIN XP: [680m17(Build9310)]"!

I checked the sample document and I believe the message is caused by Issue 75199.

When I scroll down in the open csv dialog, I see a problem in row 1347, where in
column M seems to be an incorrect line feed ('<cntrl>+<enter>'?, that repeats
several times), so that here we have an incorrect row termination and this might
be only a "damaged document" problem. 

But: 1.1.4 imports the sample document without problems.

Looks like Issue 834, but that one should have been fixed?
Comment 5 tim_c 2009-12-24 17:59:51 UTC
Not line ends but looks like you have found a problem line, and this raises 
some CSV parser issues.

Row 1347 shows a normal cr/lf pair. Previously tried single lf version of file, 
same effect. (is actually being created by Lua f:write(blah, '\n')

What does exist is row 1347 is a quoted field containing text with periods.
,"CMS"VICE.DO.M",   and a quote, so this spins off into the problem of define 
what exactly is CSV, include if and how escapes are done.

The original fixed field width data being translated into CSV contain both ' 
and " within text fields. I was unaware of the " within text fields, sorry.

The solution for OO is likely to be firming up the design of the CSV parser.

Comment 6 tim_c 2009-12-24 18:22:15 UTC
Confirmed the data causal by substituting " for ' within " quoted CSV fields.

Import than works correctly.
Comment 7 ooo 2010-01-07 15:52:48 UTC
Data contains unescaped " delimiters.

*** This issue has been marked as a duplicate of 78926 ***
Comment 8 ooo 2010-01-07 15:53:28 UTC
Closing dup.