Issue 80385 - CSV import of quoted data gives unpredictable results
Summary: CSV import of quoted data gives unpredictable results
Status: CLOSED DUPLICATE of issue 78926
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: OOo 2.2.1
Hardware: PC Windows XP
: P3 Trivial (vote)
Target Milestone: ---
Assignee: spreadsheet
QA Contact: issues@sc
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-06 16:27 UTC by nirendram
Modified: 2007-08-08 18:50 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
First test case (203 bytes, text/plain)
2007-08-06 16:28 UTC, nirendram
no flags Details
Second test case (202 bytes, text/plain)
2007-08-06 16:29 UTC, nirendram
no flags Details
Excel - first test case (33.88 KB, image/png)
2007-08-06 16:30 UTC, nirendram
no flags Details
Excel - second test case (33.37 KB, image/png)
2007-08-06 16:31 UTC, nirendram
no flags Details
Calc - first test case (29.28 KB, image/png)
2007-08-06 16:31 UTC, nirendram
no flags Details
Calc - second test case (31.83 KB, image/png)
2007-08-06 16:32 UTC, nirendram
no flags Details
Third test case - valid file (204 bytes, text/plain)
2007-08-07 14:15 UTC, nirendram
no flags Details
Fourth test case - valid file (203 bytes, text/plain)
2007-08-07 14:16 UTC, nirendram
no flags Details
Excel - third test case (32.48 KB, image/png)
2007-08-07 14:17 UTC, nirendram
no flags Details
Excel - fourth test case (32.48 KB, image/png)
2007-08-07 14:18 UTC, nirendram
no flags Details
Calc - third test case (27.01 KB, image/png)
2007-08-07 14:19 UTC, nirendram
no flags Details
Calc - fourth test case (29.77 KB, image/png)
2007-08-07 14:20 UTC, nirendram
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description nirendram 2007-08-06 16:27:54 UTC
I was trying to figure out how Excel and Calc handle quoted strings in CSV
files, so I created a file with some nice confusing data. This loaded almost
perfectly in Excel (the exception being the last line of the file), but produced
funny results in Calc, to say the least.

I have two test cases, and the relevant screen shots of Excel and Calc. The only
difference between the two test cases is the extra quote at the end of the first
line.
Comment 1 nirendram 2007-08-06 16:28:56 UTC
Created attachment 47340 [details]
First test case
Comment 2 nirendram 2007-08-06 16:29:40 UTC
Created attachment 47341 [details]
Second test case
Comment 3 nirendram 2007-08-06 16:30:29 UTC
Created attachment 47342 [details]
Excel - first test case
Comment 4 nirendram 2007-08-06 16:31:13 UTC
Created attachment 47343 [details]
Excel - second test case
Comment 5 nirendram 2007-08-06 16:31:37 UTC
Created attachment 47344 [details]
Calc - first test case
Comment 6 nirendram 2007-08-06 16:32:06 UTC
Created attachment 47345 [details]
Calc - second test case
Comment 7 discoleo 2007-08-06 21:43:11 UTC
Those are most probably broken csv-files. Therefore this is a duplicate of issue
78926.

Please note, that a *line break* (CRLF) is allowed inside a field and this is
accomplished by including the break within double quotes, e.g.:
"this is a single field
that continues on the 2nd line"

Therefore, IF a quote is NOT closed, the field will continue with the next line.
This is valid behaviour! (And IF changed, one would NOT be able to include line
breaks inside fields!)

See also http://tools.ietf.org/html/rfc4180 for further details.
Comment 8 nirendram 2007-08-07 08:50:46 UTC
Excuse me? Did you even look at anything that I attached?

The files I attached both have the following structure:
Line 1: 13 fields
Line 2:  3 fields
Line 3:  3 fields, 2 of which have multiple lines
Line 4:  1 field
Line 5:  2 fields
Line 6:  2 fields
Line 7:  1 field with multiple lines, no ending quote

The only thing funny about my input files is that all the lines are not the same
length. However, Calc attempts to read it in and does not give an error or
warning message, so I assume that this is allowed.

Now let me describe the field format when Calc reads in the first file:
Line 1: 13 fields
Line 2:  3 fields
Line 3:  3 fields
Line 4:  1 field
Line 5:  *3 fields*
The 3 fields in line 5 are made up of the contents of the rest of the file. The
first field is fine. The second field contains the last field of line 5, and the
first field of line 6. The third field contains the last field of line 6, and
the field in line 7. Is this right?

When Calc reads in the second file (which is exactly the same apart from a quote
at the end of the last field in line 1, which should not make a difference), it
suddenly comes out with only *4* lines, because the last field of the first line
has been combined with the first field of the second line, and the other fields
in the second line have been appended to the first line!

I know I am stating in words what my screenshots are supposed to describe, but
at least now one can see my problem without looking at the screenshots or
attachments which I specifically created to describe the problem.
Comment 9 frank 2007-08-07 13:43:08 UTC
Hi,

discoleo is right. This one is a double to Issue 78926 .

For now a starting quote or double quote has to have an ending quote. All
between these quotes or better text delimiters goes to one cell. Mixing up these
quotes will result in bad imported cells, but this is not a bug of OOo but the
creator of such a file is to blame for it.

Frank

*** This issue has been marked as a duplicate of 78926 ***
Comment 10 frank 2007-08-07 13:43:49 UTC
closed double but I tend to close it as invalid.
Comment 11 nirendram 2007-08-07 14:14:04 UTC
Dear friends discoleo and fst,

It still appears that nobody has looked at my data, or even read my nice problem
description. Instead you seem to have become fixated on the idea that I have
invalid data in my file. Well, I will accept that the last line of data is
incorrect, as it should have a closing quote. So I decided to add a closing
quote to the last line of data and try again.

Wow! Now Calc has silently dropped the last three lines of data in my CSV file,
although the lines are visible in the text import dialog. Is this still correct
behaviour?
Comment 12 nirendram 2007-08-07 14:15:34 UTC
Created attachment 47358 [details]
Third test case - valid file
Comment 13 nirendram 2007-08-07 14:16:14 UTC
Created attachment 47359 [details]
Fourth test case - valid file
Comment 14 nirendram 2007-08-07 14:17:17 UTC
Created attachment 47360 [details]
Excel - third test case
Comment 15 nirendram 2007-08-07 14:18:58 UTC
Created attachment 47361 [details]
Excel - fourth test case
Comment 16 nirendram 2007-08-07 14:19:55 UTC
Created attachment 47362 [details]
Calc - third test case
Comment 17 nirendram 2007-08-07 14:20:45 UTC
Created attachment 47363 [details]
Calc - fourth test case
Comment 18 discoleo 2007-08-07 19:24:52 UTC
Please note that all your files are broken csv-files. There are a set of simple
rules in csv-files:

1. text-containing quotes shall be enclosed in quotes
   and
   the text-quote shall be doubled
   e.g.  this "-quote must be doubled
=> csv: "this ""-quote must be doubled"

2. the whole field shall be enclosed in quotes
=> therefore "a"b is undefined
   I would interpret it as enclosing quotes => ab
  [OOo differs here => "a"b]

3. breaks inside a quoted field are considered as part of the field

Therefore, Excel is really broken, but your file is the primary culprit:

1. |test| , |->"a"<-b| , |->"a"<-| , |d->"a"<-c| , |m| , |->"j"<- | ,
   |d| , |b->"A"<-| , |D->"E,"<-f| , |1->","<-a| , |b->","<-de->"b,a"<-|
2. |->"abcdef"<- test ->"abc"<-| , |->"def"<-g->"h"<-| , |def->"gh"<-|
3. |->"this is
    a test"<-| , |->"yes
    it
    works"<-| ,
4. |No it doesn't|
5. |->"yes it,"","<-does->","<-no->"
    """"a,cd""e
    "<-a->""<-???BIG PROBLEM UDEFINED???<-b->"""???<-NO ENDING QUOTE
   |<== ??? DOES THE FIELD REALLY END

So, understood?

[OOo has some quirks, too, but it is really the undefined behaviour of your
broken file that is the problem!]
Comment 19 discoleo 2007-08-07 19:31:46 UTC
Seems that OOo does NOT import last quoted field, IF the quote is NOT closed.

Eventually, we should rename this issue to reflect that fact. And that should be
fixed. [Still, the csv is badly broken.]
Comment 20 nirendram 2007-08-08 12:14:36 UTC
Sir,

Here is what I have observed regarding string fields in CSV files in Calc (and
Excel, for that matter):
- A field is considered a delimited string ONLY IF there is a quote
  IMMEDIATELY after a field separator.
- Leading and trailing space in a field IS SIGNIFICANT. This means
  that if a field begins with a space and then a quote, it is not
  interpreted as a string, but is stored as is.
- A delimited string may continue with other data (that may contain
  quotes) until the end of a field is reached. This subsequent data
  is stored AS IS, i.e. no quote translation.
- As a result of the preceding point, ONLY THE FIRST PROPERLY QUOTED
  STRING IS RECOGNISED AND TRANSLATED.

According to the rules I described above, here is how the fields in test case 3
would have been read in.

1. |test| , |->"a"<-b| , | "a"| , | d"a"c | , | m | , | "j" | , |d| ,
   |b"A"| , |D"E| , |->"f,1"<-| , |->"a,b"<-| , |->"de"<-b| , |a"|
2. |->"abcdef"<- test "abc"| , |->"def"<-g"h"| , |def"gh"|
3. |->"this is
    a test"<-| , |->"yes
    it
    works"<-| , ||
4. |No it doesn't|
5. |->"yes it,"","<-does"| , |->"no"<-|
6. |->""""<-a| , |cd""e|
7. |->"a""b"""<-|

This is consistent with how Excel reads in the file.

Whilst I appreciate your attempt at explaining how Calc reads in the file, I
must disagree with it because you interpret strings wherever they occur in a
field, whereas Calc interprets the first 4 lines of test case 1 in the exact
same way as I described above. Therefore, I assumed its behaviour would be
consistent throughout the file.

And in response to your statements that my CSV file is "broken", it may be that
it is so, but let me quote some of what is written under "Interoperability
considerations" in RFC 4180:

     "Due to lack of a single specification, there are considerable
      differences among implementations.  Implementors should "be
      conservative in what you do, be liberal in what you accept from
      others" (RFC 793 [8]) when processing CSV files."
Comment 21 discoleo 2007-08-08 13:51:57 UTC
From RFC4180:

6.  Fields containing line breaks (CRLF), *double quotes*, and commas
    should be enclosed in double-quotes.

7.  If double-quotes are used to enclose fields, then a double-quote
    appearing inside a field must be escaped by preceding it with
    another double quote.

Therefore, you CAN'T MIX QUOTES as TEXT-Delimiters AND as unescaped QUOTES, like in:
   ...,"this is enclosed in quotes" BUT "THIS IS NOT",...

A. So, applying these rules, something like: d"a"c, or "abc"de"f" is ILLEGAL and
undefined. [When selecting '"' as a TEXT-Delimiter.]

B. IF the user selects TO USE '"' AS THE TEXT DELIMITER, then he acknowledges
that every occurrence of '"' that is NOT escaped is actually a FIELD-Quote.
Otherwise, he (and the application saving this) would have escaped it properly.

Therefore, IF you really want:
   d"a"c
  then code it properly:
   "d""a""c"

Otherwise, TREAT ALL QUOTES AS STRINGS. You have the option to select this in
the import dialog.


My handling of the error
========================
IF such a broken csv-file is encountered, and the user insists on '"' as the
TEXT-Delimiter, then:
   EVERY occurrence of '"' that is NOT escaped SHALL be considered
      as a TEXT-Delimiter;
   the USER opted for this, so you can't just consider some of them are
      TEXT-delimiters, some are NOT - this is INCONSISTENT!

Therefore,
   d"a"c becomes really
   dac
Similarly,
   "abcdef"test"abc" becomes
   abcdeftestabc

This is consistent!

Else, how should this be handled:
 ,"abc" d "ef",
IS the FIELD actually:
 1. 'abc" d "ef' AND the application forgot to double the QUOTES
    (valid assumption)
 2. 'abc d ef' - my view - CONSISTENT HANDLING
 3. 'abc d "ef"' - your view - the least consistent of these 4 methods
 4. "abc" d "ef" - NO ESCAPING QUOTES at all, application does NOT use them
    (valid assumption, but user has the freedom to select NOT to use them)
Comment 22 nirendram 2007-08-08 14:27:41 UTC
Sir,

Let us be clear on one thing. I am discussing the existing behaviour of Calc.
What I want is not an enhancement or new way to handle CSV files, I merely want
the handling to be consistent, and to reflect what I have in the file reasonably
accurately. I discovered an inconsistency in the current behaviour, and reported it.

The examples and screenshots I have given are not wishful thinking on my part as
to how CSV files should be read in. Calc actually does read these files in in
this way, up to a point. Therefore my observations about quoted string behaviour
(only the first string gets converted) is ACTUALLY WHAT HAPPENS, and not
something I made up in my head.

> Else, how should this be handled:
>  ,"abc" d "ef",
> IS the FIELD actually:
>  1. 'abc" d "ef' AND the application forgot to double the QUOTES
>     (valid assumption)
>  2. 'abc d ef' - my view - CONSISTENT HANDLING
>  3. 'abc d "ef"' - your view - the least consistent of these 4 methods

This is not my view, it is actually what happens in Calc and Excel. If you don't
believe me, please try it yourself.
Comment 23 ooo 2007-08-08 16:56:01 UTC
> Let us be clear on one thing. I am discussing the existing behaviour of Calc.
> What I want is not an enhancement or new way to handle CSV files, I merely want
> the handling to be consistent, and to reflect what I have in the file reasonably
> accurately.

Isn't that somewhat contradicting? You have broken data in your file and
want it reflected reasonably accurately? Yes, we can add some more magic
to try to guess broken data correctly. Yes, we will do eventually, this
is what issue 78926 is about, so this issue here in fact is a duplicate.

> > Else, how should this be handled:
> >  ,"abc" d "ef",
> > IS the FIELD actually:
> >  1. 'abc" d "ef' AND the application forgot to double the QUOTES
> >     (valid assumption)

This is how it should be treated, IMHO. Doing so would also lead to
,"a"b, "a",
resulting in _one_ field
'a"b, "a'
and not two, 'ab' and ' "a"' like it is currently the case. I could bet
sooner or later someone will show up and complain that we don't do it
like Excel ...

> >  2. 'abc d ef' - my view - CONSISTENT HANDLING

Consistent, but also loses information.

> >  3. 'abc d "ef"' - your view - the least consistent of these 4 methods
> 
> This is not my view, it is actually what happens in Calc and Excel. If you don't
> believe me, please try it yourself.

Yes, that's how this case of broken data that doesn't follow the escaped
quotes rule currently may be treated.


*** This issue has been marked as a duplicate of 78926 ***
Comment 24 ooo 2007-08-08 17:06:56 UTC
Closing dup.
Comment 25 nirendram 2007-08-08 18:50:15 UTC
Friend er,

> Isn't that somewhat contradicting? You have broken data in your file and
> want it reflected reasonably accurately? Yes, we can add some more magic
> to try to guess broken data correctly. Yes, we will do eventually, this
> is what issue 78926 is about, so this issue here in fact is a duplicate.

I do not ask for any magic or guessing of data. All I ask is for the treatment
of data to be consistent. I thought I knew what Calc was doing, and that it
would do the same or similar thing as Excel, but it looks like I was wrong there.

To be honest with you, I was writing a C routine to read in CSV files, and I
wanted to find out how they were handled in production applications, hence my
creation of the "broken" CSV files. Excel appears to handle "broken" data rather
gracefully. Calc... rather less gracefully, losing some data in the process.
Surely it is important not to lose data? Surely it is important to apply rules
consistently? 

I guess I might as well give up. At the very least, could you put something in
the documentation somewhere that specifies exactly how Calc reads in CSV files?
At least then, people like me wouldn't be unpleasantly surprised.