Issue 126805 - Importing CSV where apostrophe is separator and contained in text causes remaining rows to not import
Summary: Importing CSV where apostrophe is separator and contained in text causes rema...
Status: RESOLVED FIXED
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: 4.1.2
Hardware: All All
: P5 (lowest) Normal (vote)
Target Milestone: 4.2.0
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
: 123831 (view as issue list)
Depends on:
Blocks:
 
Reported: 2016-01-22 22:52 UTC by Tom Hall
Modified: 2016-05-08 16:44 UTC (History)
4 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---


Attachments
CSV file that can not be completely imported. (26.31 KB, text/plain)
2016-01-22 22:52 UTC, Tom Hall
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description Tom Hall 2016-01-22 22:52:39 UTC
Created attachment 85264 [details]
CSV file that can not be completely imported.

When importing a csv file with an apostrophe as text and numeric delimiters and one of the text fields has an apostrophe in it, OpenOffice will stop importing the remainder of the file when the Separator Options are made to eliminate the apostrophes around the text and numeric values. Attached is a CSV file that can not be imported past row 124. This file imports correctly into Microsoft Excel.

When the apostrophes are not showing in the preview window on the bottom of the Text Import window, scroll down to row 124 then scroll to the right. You will see the remainder of the rows are continuing across on row 124.
Comment 1 mroe 2016-01-23 11:51:33 UTC
Can confirm the issue with the attached sample file. (row 124)

Same problem if the CSV is used as Base datasource.
Comment 2 bmarcelly 2016-01-30 13:43:57 UTC
If apostrophe is used as text delimiter, a single apostrophe must not appear within a text field. This is a matter of logic.

Within a text any apostrophe must be doubled.

Line 124 should be :
'ARLINGTON IN','ARLINGTON INT''L AVIATION','150227','12/18/2015','470','35','02/02/2016'

With this modification the entire file is imported correctly.

See also https://tools.ietf.org/html/rfc4180 section 2 point 7 :
  If double-quotes are used to enclose fields, then a double-quote
  appearing inside a field must be escaped by preceding it with
  another double quote.  For example:
       "aaa","b""bb","ccc"
Comment 3 damjan 2016-04-03 18:14:28 UTC
Excel 2007 imports the fields surrounded with '. Apache Commons CSV can also parse all the lines in this format. AOO parses this too as long as the "Text delimiter" is not ', although it strips the leading ' which seems like a separate bug.

What is very impressive is Excel's behaviour when ' is replaced with ". It correctly imports all lines, with the extraneous " inside the text removed. I have no idea how it does this - it must use advanced column and field auto-detection heuristics or something. Neither AOO nor Apache Commons CSV are able to get past line 124.
Comment 4 damjan 2016-04-18 17:36:20 UTC
I didn't intend to fix this bug with the patch in r1739628, but I did. The problem was that the CSV line parser was trying to read and concatenate lines until it gets an even number of quote characters, which never happened. It now parses lines using the same logic as the CSV field parsers. The behaviour also matches Excel's.

Resolving fixed. Thank you for your bug report!
Comment 5 damjan 2016-04-19 01:38:55 UTC
*** Issue 123831 has been marked as a duplicate of this issue. ***
Comment 6 Kay 2016-04-27 21:10:40 UTC
Results of my test with Linux-32 r. 1739631

The file imports completely. However line 124, second field from the original file as:

'ARLINGTON INT'L AVIATION'

gets imported as:

ARLINGTON INTL AVIATION'

when I use the following as CSV import specifications:

* text separator -- comma
* text deliminitor -- ' (single quote)

So the internal quote in INT'L gets dropped.
Comment 7 damjan 2016-05-07 08:27:38 UTC
(In reply to Kay from comment #6)
> Results of my test with Linux-32 r. 1739631
> 
> The file imports completely. However line 124, second field from the
> original file as:
> 
> 'ARLINGTON INT'L AVIATION'
> 
> gets imported as:
> 
> ARLINGTON INTL AVIATION'
> 
> when I use the following as CSV import specifications:
> 
> * text separator -- comma
> * text deliminitor -- ' (single quote)
> 
> So the internal quote in INT'L gets dropped.

Which is exactly what Excel does in its "Text to columns" with that line, and which is how it generally works: fields starting with the quote character continue until the first quote character without an adjacent quote character, and then any subsequent text is appended as it is, without considering quoting at all.
Comment 8 Kay 2016-05-08 16:44:14 UTC
(In reply to damjan from comment #7)
> (In reply to Kay from comment #6)
> > Results of my test with Linux-32 r. 1739631
> > 
> > The file imports completely. However line 124, second field from the
> > original file as:
> > 
> > 'ARLINGTON INT'L AVIATION'
> > 
> > gets imported as:
> > 
> > ARLINGTON INTL AVIATION'
> > 
> > when I use the following as CSV import specifications:
> > 
> > * text separator -- comma
> > * text deliminitor -- ' (single quote)
> > 
> > So the internal quote in INT'L gets dropped.
> 
> Which is exactly what Excel does in its "Text to columns" with that line,
> and which is how it generally works: fields starting with the quote
> character continue until the first quote character without an adjacent quote
> character, and then any subsequent text is appended as it is, without
> considering quoting at all.

Ah. OK. I haven't used Excel in many many years.