Issue 85946 - CSV import - auto-detect delimiters
Summary: CSV import - auto-detect delimiters
Status: CLOSED WONT_FIX
Alias: None
Product: Calc
Classification: Application
Component: open-import (show other issues)
Version: OOo 1.0.0
Hardware: All All
: P4 Trivial with 5 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-07 17:00 UTC by fabianvss
Modified: 2017-05-20 09:55 UTC (History)
3 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description fabianvss 2008-02-07 17:00:46 UTC
The text import dialog which appears when opening a CSV file, should analyze the
text and automatically select most likely delimiter options.

By default, a semicolon is selected for separating fields and a double quote for
text. But often, f.e. in regions where decimal separators are commas instead of
dots, semicolons are used for separating the fields and it is very annoying to
change this every time when opening a file.

It should do the following:

- search for one tabulator
- if found then use tabulators as field delimiter
- else
  - count all commas and semicolons
  - if semicolons > commas then use semicolons as field delimiter
  - else use commas as field delimiter

Now, just counting text delimiters isn't sufficient as the following example of
a CSV row shows:
I'm here,I'm there,"I'm nowhere, but aware"

- count all double and single quote pairs around fields (field delimiter or
start of line on the left, and field delimiter or end of line on the right)
(a raw search could lead initiating the complex search when 2 occurences are
found, otherwise the number of pairs is treated as 0)
- if single quote pairs > double quote pairs then use single quotes as text
delimiter
- else use double quotes as text delimiter
Comment 1 oc 2008-07-15 10:45:01 UTC
reassigning features and enhancements to user requirements@openoffice.org which
will be the default owner for those tasks (was introduced some time ago)
Comment 2 ali_b 2009-12-11 11:28:23 UTC
Some thoughts about this issue:

Autodetect delimiters would be great.
So when the import dialog opens, the detected delimiters should be already set.
So all I have to do is clicking ok after I checked whether the autodetection was
correct.

Also there should be a commandline option to provide all necessary csv import
options, so in this case no dialog would appear.

The last part is very important for automating some processes in companies.

Related issues are:
72981, 97416 and others
Comment 3 jnothman 2010-03-25 06:49:12 UTC
I think the first approach should be:
1) For Text Import on Open, use the last used settings for the given filename! 
I.e. keep a cached mapping with the document history. This may be impractical.
2) Otherwise, detect.

The detection of text delimiter should perhaps precede column delimiters, 
because there is much greater variability in the latter, and it is hard to 
determine without having first stripped out quoted/escaped portions.

Assuming there is no quoting (i.e. it has been stripped, or it does not exist 
in the first place), one approach to finding column delimiters would be to find 
all non-alphanumeric characters which have constant frequency >= 1 on all lines 
of the file.

Or if we can't strip quoted text, find all non-alphanumeric, non-quote 
characters which have a constant minimum value >= 1 on each line.

Heuristics or a machine learning approach might then select which result is 
most appropriate in case of conflict.


Alternatively, brute force it: determine which (common?) pairs of delimiters 
give the text integrity (same number of cols per line), and then use heuristics 
to decide (e.g. prefer quoted over unquoted; tab over semicolon, comma or 
space; more columns over fewer?).

If OOo could acquire a collection of test text spreadsheets it might be helpful!
Comment 4 msaum 2011-06-17 16:33:53 UTC
No changes to this in over a year? This is a sticking point since Excel can open CSV's without import utility....?  It is big for my workplace...
Comment 5 Edwin Sharp 2013-12-18 13:29:15 UTC
Text Import dialog window remembers user settings. A company will have a certain CSV format so IMHO no need for guess work algorithm.