Issue 78926

Summary: Might apply some magic to CSV import with broken non-escaped quotes in fields
Product: Calc Reporter: aronchi <alessandro.ronchi>
Component: open-importAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: chris, gbpacheco, issues, kozodaevroman, kyoshida, mh.hh, npardington, vulcain
Version: OOo 2.2   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: TASK Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Input csv file
none
Output file
none
Truncates data without warning (comma field separator, double-quote text delimiter)
none
Inconsistent text delimiter behavior (comma field separator, double-quote text delimiter)
none
Confusing "maximum rows exceeded" error (comma field separator, double-quote text delimiter) none

Description aronchi 2007-06-27 10:20:19 UTC
I import a CSV file (input) with some columns, make some changes in calc
and then export the files as csv again.

In the second file (output) I have a lot of empty columns at the end of the
real columns. it seems calc exports also the emty columns of my table.

Also if I made no changes on the file, when I export the file to csv
again it has a great number of empty column.
Comment 1 aronchi 2007-06-27 10:21:52 UTC
Created attachment 46282 [details]
Input csv file
Comment 2 aronchi 2007-06-27 10:22:42 UTC
Created attachment 46284 [details]
Output file
Comment 3 ooo 2007-06-27 19:35:17 UTC
1. Your input file is broken.
   If a quoted field contains a quote the software creating the file
   escaped the quote using a backslash (\"), which is wrong and may
   result in data fields being misaligned. Contained quotes must be
   escaped by doubling them ("").

2. In case an odd number of wrongly escaped quotes is contained within
   a field it depends on the context how the line is read.
   For example see row 774 where data from line 774 (starting with
   "773";"1";"BIGODINI RISCALDANTI \"BA";...) is followed by data from
   line 774 and following lines until a pairing quote is matched,
   resulting in a long row up to cell IQ774, which upon export results
   in that many columns.

3. However, this special case should not result in a line concatenation
   because the preceding fields up to AI774 were recognized and aligned
   correctly the same as in other lines that contain an even number of
   broken quotes. [hey, good piece of software, isn't it? Your data is
   broken and still it is read almost correctly ;-) except that
   non-doubled quotes are lost in all fields]

4. I take this issue only to investigate what the special case here
   exactly is and maybe find some magic to apply.

5. Remember, your input file is broken.
Comment 4 ooo 2007-07-11 11:59:07 UTC
*** Issue 76874 has been marked as a duplicate of this issue. ***
Comment 5 mhatheoo 2007-07-15 20:58:07 UTC
as the reporter to the issue 76874 - which is closed for being a dublicate of
this issue (see the number:cool stuff!) - I want to find out, whether it is
really the same defect or not

1. The feature that the textdelimiter is constructed from
quote/doublequote+colon/semicolon as a must, is a new invention to OO.o
That way is ISO- or anythingelse standard? Okay, but it is not very professional
to invent something against the data, that can be found in the real world, one
has to deal with that.

2. however, that feature is misbehaving anyway.
In case of broken import, OO.o should detect the error for the very
record/import that is incomplete, and continue after that.
But it seem OO.o has not any error-routine for CSV-import. 

hallo er:
right now the main problem is the import-function, and that is defect - I hope
you will deal it with not as an academic but as a real problem.


Martin
Comment 6 ooo 2007-07-16 11:06:51 UTC
> 1. The feature that the textdelimiter is constructed from
> quote/doublequote+colon/semicolon as a must, is a new invention to OO.o
> That way is ISO- or anythingelse standard? Okay, but it is not very professional
> to invent something against the data, that can be found in the real world, one
> has to deal with that.

This is simply not true. The behavior regarding quotes is not an
invention of OOo, nor is a quote a must if the delimiter or a quote is
not present in the field content. However, an embedded quote _must_ be
escaped by doubling it. Again, this is not an invention of OOo, this is
common CSV handling, and no it's not ISO or any standard, just as CSV is
not standardized at all. But there is an RFC 4180 now. You may want to
read more about the CSV format at
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
http://tools.ietf.org/html/rfc4180
Comment 7 mhatheoo 2007-07-17 20:27:40 UTC
well er, this is a sort of discussion ..
and maybe it is all sort of a misunderstanding.

I just learnd, that the CSV-import for my "wrong-build" file is working.
The trick/the solution:
delete the default text-delimiter from the list-box.  

In that case the field-delimiter is the ;-semicolon only and all records will be
imported. The "-doublequote will not be detected (as expected !) and can be
searched and deleted later.

So the misunderstanding is, that the behavior changed and that former existing
"no Text-delimiter" from the listbox has to be removed manually.

Regarding the "start-stop"-function of the text-delimiter ' or " you are right.
But that was not the problem. The problem I see is, that the error-handling is
not functioning in a proper manner. OO.o should not stop import csv-lines when
it sees a logical error (colum-length, delimiter-error) until the file is
finished - and than return a error-message. This would help with this issue.

to solve my issue 76874 a selection "no delimiter" on the import-dialog would
had helped the most.

Martin


 
Comment 8 frank 2007-08-07 13:43:08 UTC
*** Issue 80385 has been marked as a duplicate of this issue. ***
Comment 9 ooo 2007-08-08 16:56:00 UTC
*** Issue 80385 has been marked as a duplicate of this issue. ***
Comment 10 ooo 2007-08-08 17:06:17 UTC
Another test case from issue 80385:
,"abc" d "ef",
currently results in
'abc d "ef"'
To not lose data it should result in
'abc" d "ef'

Doing so would also lead to
,"a"b, "a",
resulting in _one_ field
'a"b, "a'
and not two, 'ab' and ' "a"' like it is currently the case. This would then
differ from how Excel treats it, but would be more consistent.

See also test case documents attached to issue 80385:
http://www.openoffice.org/nonav/issues/showattachment.cgi/47340/test1.csv
http://www.openoffice.org/nonav/issues/showattachment.cgi/47341/test2.csv
http://www.openoffice.org/nonav/issues/showattachment.cgi/47358/test3.csv
http://www.openoffice.org/nonav/issues/showattachment.cgi/47359/test4.csv
Comment 11 discoleo 2007-08-08 20:50:51 UTC
How should this be handled:
...,"abc" d "ef",...
IS the FIELD actually:
 1. 'abc" d "ef' AND the application forgot to double the QUOTES
    (valid assumption)

I believe, this is indeed the best solution. Only quotes surrounding the full
string (i.e. preceded or followed by the FIELD-Delimiter) should be viewed as
ESCAPING-Quotes, everything else should be considered as an ordinary string.

[Matching all quotes, as discussed in issue 80385 is another consistent way to
recover these broken csv-files.]
Comment 12 frank 2007-08-15 08:23:46 UTC
*** Issue 80683 has been marked as a duplicate of this issue. ***
Comment 13 mhatheoo 2007-09-24 00:59:56 UTC
@ er

I treat dataloss as not acceptable in no way

even if "," is rather outdated, you should trust, programs which formed that did
know well, how to deal with ' and "
If you find new files, it must not be that the new programs work with or even
ment to use that behaviour. You need to understand that.

one of the ' or " can be the starting/ending delimiter, the other one can be in
those braketts in an uncounted number. that is correct. 

in the example , colon is th only delimiter:
,"abc" d "ef",
so for that the result needs to be:
,'"abc" d "ef"',
or the delimiter should be substituted be something more reliable like
;"abc" d "ef";

Martin
Comment 14 frank 2008-07-21 11:30:30 UTC
*** Issue 91825 has been marked as a duplicate of this issue. ***
Comment 15 niklas.nebel 2009-08-04 17:49:49 UTC
*** Issue 103939 has been marked as a duplicate of this issue. ***
Comment 16 niklas.nebel 2009-08-04 17:51:16 UTC
Continuing the discussion from issue 103939: Actually implementing something
like the description of the bEmbeddedLineBreak parameter to ReadCsvLine might
already be enough. The current implementation is simpler than described.
Comment 17 ooo 2009-10-02 15:28:04 UTC
*** Issue 84048 has been marked as a duplicate of this issue. ***
Comment 18 ooo 2010-01-07 15:52:27 UTC
*** Issue 107867 has been marked as a duplicate of this issue. ***
Comment 19 christopher_schultz 2010-11-30 19:54:36 UTC
I'm observing a bunch of things that I was able to isolate into some sample
files, which I'll attach.

I'm running OOo 3.2.1 (OOO320m18 Build:9502) on Microsoft Windows 7.

I'm using comma (,) as my field separator and double-quote (") as my text delimiter.

Here is a summary of the things I discovered:

1. Using an unescaped text delimiter within a field causes the CSV parser to
  a) include the " in the field
  b) start searching for second unpaired " later in the file
  c) include the closing " in the field
2. #1 only occurs if a field before it on the same line was quoted
3. After #1 occurs, it's possible OOo will attempt to read /lots/ of data into a
single field which results in the error "the maximum number of rows has been
exceeded"
Comment 20 christopher_schultz 2010-11-30 19:56:14 UTC
Created attachment 75189 [details]
Truncates data without warning (comma field separator, double-quote text delimiter)
Comment 21 christopher_schultz 2010-11-30 19:56:51 UTC
Created attachment 75191 [details]
Inconsistent text delimiter behavior (comma field separator, double-quote text delimiter)
Comment 22 christopher_schultz 2010-11-30 19:57:42 UTC
Created attachment 75192 [details]
Confusing "maximum rows exceeded" error  (comma field separator, double-quote text delimiter)
Comment 23 ooo 2011-03-31 14:56:16 UTC
*** Issue 117612 has been marked as a duplicate of this issue. ***
Comment 24 Jay Hannah 2012-01-16 04:12:47 UTC
Comment on attachment 75192 [details]
Confusing "maximum rows exceeded" error  (comma field separator, double-quote text delimiter)

Attachment 75192 [details] is exactly the bogus error message I just expended lots of energy  tracking down. An error message pointing out "unbalanced single/double quote on line 858" would have saved me hours of troubleshooting. Since my file was tab-delimited, unbalanced quote somewhere in the file had not occurred to me. Thank you for your time.
Comment 25 vulcain 2013-01-28 09:47:56 UTC
Resolved in LibreOffice 3.6:
https://bugs.freedesktop.org/show_bug.cgi?id=48621
Comment 26 Christopher Schultz 2013-01-28 14:40:13 UTC
(In reply to comment #25)
> Resolved in LibreOffice 3.6:
> https://bugs.freedesktop.org/show_bug.cgi?id=48621

Confirmed fixed in 3.6.4.2 on Mac OS X:
- Attachment #75189 [details]
- Attachment #75191 [details]
- Attachment #75192 [details]

I still get an error loading the original "input" (attachment #46282 [details]): "The data could not be loaded completely because the maximum number of characters per cell was exceeded".
Comment 27 Christopher Schultz 2013-01-28 14:44:34 UTC
(In reply to comment #26)
> Confirmed fixed in 3.6.4.2 on Mac OS X:

Sorry, I meant to say "fixed in LibreOffice.org 3.6.4.2 on Mac OS X". I have not yet tested OOo.
Comment 28 Marcus 2017-05-20 11:13:13 UTC
Reset assigne to the default "issues@openoffice.apache.org".