Issue 113421

Summary: CSV import could ignore leading spaces if the field content without them is quoted.
Product: Calc Reporter: guidogam <g.gambardella>
Component: open-importAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P4 CC: issues, oliver.brinzing, ooo
Version: OOo 3.2.1   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
The second field cannot be impoorted, with european settings none

Description guidogam 2010-07-26 10:54:43 UTC
A CSV file is made up of fields separated by commas and records separated by 
newlines. If a field contains spaces, *commas*, newlines or quotation marks it 
must be enclosed by quotation marks, and the eventual quotation marks doubled.

That's it. If a quoted field contains commas, spaces or newlines it must be 
imported *as is* in a sigle cell.

This problem, absent in version 2, is fairly important in Europe, where we use 
commas as decimal separator. Having read about similar issues, I checked 
(athough it makes little sense) also that the error shows up with the default 
character set, here "Europa Occidentale (Windows-1252/winLatin1)".

By the way, the import dialog box for CSV files is useless, as the CSV 
delimiters are well known.
Comment 1 guidogam 2010-07-26 10:59:01 UTC
Created attachment 70816 [details]
The second field cannot be impoorted, with european settings
Comment 2 ooo 2010-07-26 12:12:22 UTC
The behavior is caused by the leading space:  , "3,14159",
If that is changed to  ,"3,14159",  instead it works as expected.

It is not true that if a field contains spaces it must be enclosed in quotation
marks. Actually the import _does_ follow the standard
http://tools.ietf.org/html/rfc4180 that in 2.4. says "Spaces are considered part
of a field and should not be ignored." and hence the field _does not_ start
quoted. At first hand the file content does not follow the standard..

I agree that for convenience the import could look if without leading spaces the
field content would be quoted and treat it as such. However, this is of low
priority.

And no, the import dialog is not useless as it it used for any type of text file
import, apart from comma separated there are also tab separated, semicolon
separated, fixed field width files and others.
Comment 3 niklas.nebel 2010-07-26 12:46:18 UTC
For now, the example file can be loaded if you select both comma and space as
delimiter, and "Merge delimiters" in the dialog.
Comment 4 guidogam 2010-07-26 13:52:33 UTC
Thanks for letting me know about RFC4180. It's weird that the original spec 
clearly stated that leading and trailing blanks (I should have specified) were 
going to be trimmed, while the RFC states the opposite. But, at least, it's a 
formal spec (although with that RFC name...) and so it's better to stick to it.

But your implementation non only does not follow RFC 793 [8], but violates also 
RFC4180 [5]: "If fields are not enclosed with double quotes, then double quotes 
may not appear inside the fields." It means that the program should have taken 
everythig between the quotes and treated it as a single field.

The import dialog for CSV files is useless, since a CSV file is what we are 
talking about. Other formats, that use tabs, semicolon or other field delimiters 
are not CSV.

Thanks for your prompt and informative answer.
Comment 5 ooo 2010-07-26 14:46:48 UTC
@guidogam:
> But your implementation non only does not follow RFC 793 [8],

How should this be related to RFC 793 TCP?

> but violates also RFC4180 [5]: "If fields are not enclosed with double
> quotes, then double quotes may not appear inside the fields." It means
> that the program should have taken everythig between the quotes and
> treated it as a single field.

No. It means that this is a file content that is not defined by the
standard, as the field is not enclosed with double quotes. Field content
starts right after the comma and there is a space. At first hand the
generator didn't follow standards, not our import implementation. Now as
we take the entire content between commas as one field and do not detect
a quoted content in this case we take all characters up to the next
delimiter, including any quotes encountered. We do so because there are
too many implementations that write broken files that have quotes in
non-quoted field content, and even unescaped quotes in quoted content.
This is a long ongoing discussion and I won't repeat it here. See issue
78926 for details.
Comment 6 guidogam 2010-07-26 15:26:10 UTC
In the "Interoperability considerations:" of RFC4180 there is a quotation of 
RFC793 about being liberal in what you accept.

In fact you are more liberal than what the standard say, since you are able to 
handle the relatively common broken files with single quotation marks.

Thanks for your time.
Comment 7 niklas.nebel 2011-01-03 12:51:36 UTC
*** Issue 116209 has been marked as a duplicate of this issue. ***
Comment 8 Oliver Brinzing 2011-01-03 17:03:45 UTC
.
Comment 9 niklas.nebel 2011-02-11 09:15:59 UTC
*** Issue 116919 has been marked as a duplicate of this issue. ***
Comment 10 schufty 2011-02-11 16:36:40 UTC
The summary for this issue does not match the content.  Having read through the 
comments, I see why my bug (116919) was marked as a duplicate, but when I 
originally searched for the problem I missed this issue because the summary 
doesn't directly mention Calc ignoring text delimiters in the csv import.

I recommend the summary be changed to: "CSV import ignores text delimiters when 
there's a leading space"

Thanks for the prompt response, knowing about the leading space resolves my 
issue.