Apache OpenOffice (AOO) Bugzilla – Issue 126805
Importing CSV where apostrophe is separator and contained in text causes remaining rows to not import
Last modified: 2016-05-08 16:44:14 UTC
Created attachment 85264 [details] CSV file that can not be completely imported. When importing a csv file with an apostrophe as text and numeric delimiters and one of the text fields has an apostrophe in it, OpenOffice will stop importing the remainder of the file when the Separator Options are made to eliminate the apostrophes around the text and numeric values. Attached is a CSV file that can not be imported past row 124. This file imports correctly into Microsoft Excel. When the apostrophes are not showing in the preview window on the bottom of the Text Import window, scroll down to row 124 then scroll to the right. You will see the remainder of the rows are continuing across on row 124.
Can confirm the issue with the attached sample file. (row 124) Same problem if the CSV is used as Base datasource.
If apostrophe is used as text delimiter, a single apostrophe must not appear within a text field. This is a matter of logic. Within a text any apostrophe must be doubled. Line 124 should be : 'ARLINGTON IN','ARLINGTON INT''L AVIATION','150227','12/18/2015','470','35','02/02/2016' With this modification the entire file is imported correctly. See also https://tools.ietf.org/html/rfc4180 section 2 point 7 : If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"
Excel 2007 imports the fields surrounded with '. Apache Commons CSV can also parse all the lines in this format. AOO parses this too as long as the "Text delimiter" is not ', although it strips the leading ' which seems like a separate bug. What is very impressive is Excel's behaviour when ' is replaced with ". It correctly imports all lines, with the extraneous " inside the text removed. I have no idea how it does this - it must use advanced column and field auto-detection heuristics or something. Neither AOO nor Apache Commons CSV are able to get past line 124.
I didn't intend to fix this bug with the patch in r1739628, but I did. The problem was that the CSV line parser was trying to read and concatenate lines until it gets an even number of quote characters, which never happened. It now parses lines using the same logic as the CSV field parsers. The behaviour also matches Excel's. Resolving fixed. Thank you for your bug report!
*** Issue 123831 has been marked as a duplicate of this issue. ***
Results of my test with Linux-32 r. 1739631 The file imports completely. However line 124, second field from the original file as: 'ARLINGTON INT'L AVIATION' gets imported as: ARLINGTON INTL AVIATION' when I use the following as CSV import specifications: * text separator -- comma * text deliminitor -- ' (single quote) So the internal quote in INT'L gets dropped.
(In reply to Kay from comment #6) > Results of my test with Linux-32 r. 1739631 > > The file imports completely. However line 124, second field from the > original file as: > > 'ARLINGTON INT'L AVIATION' > > gets imported as: > > ARLINGTON INTL AVIATION' > > when I use the following as CSV import specifications: > > * text separator -- comma > * text deliminitor -- ' (single quote) > > So the internal quote in INT'L gets dropped. Which is exactly what Excel does in its "Text to columns" with that line, and which is how it generally works: fields starting with the quote character continue until the first quote character without an adjacent quote character, and then any subsequent text is appended as it is, without considering quoting at all.
(In reply to damjan from comment #7) > (In reply to Kay from comment #6) > > Results of my test with Linux-32 r. 1739631 > > > > The file imports completely. However line 124, second field from the > > original file as: > > > > 'ARLINGTON INT'L AVIATION' > > > > gets imported as: > > > > ARLINGTON INTL AVIATION' > > > > when I use the following as CSV import specifications: > > > > * text separator -- comma > > * text deliminitor -- ' (single quote) > > > > So the internal quote in INT'L gets dropped. > > Which is exactly what Excel does in its "Text to columns" with that line, > and which is how it generally works: fields starting with the quote > character continue until the first quote character without an adjacent quote > character, and then any subsequent text is appended as it is, without > considering quoting at all. Ah. OK. I haven't used Excel in many many years.