Apache OpenOffice (AOO) Bugzilla – Issue 85946
CSV import - auto-detect delimiters
Last modified: 2017-05-20 09:55:39 UTC
The text import dialog which appears when opening a CSV file, should analyze the text and automatically select most likely delimiter options. By default, a semicolon is selected for separating fields and a double quote for text. But often, f.e. in regions where decimal separators are commas instead of dots, semicolons are used for separating the fields and it is very annoying to change this every time when opening a file. It should do the following: - search for one tabulator - if found then use tabulators as field delimiter - else - count all commas and semicolons - if semicolons > commas then use semicolons as field delimiter - else use commas as field delimiter Now, just counting text delimiters isn't sufficient as the following example of a CSV row shows: I'm here,I'm there,"I'm nowhere, but aware" - count all double and single quote pairs around fields (field delimiter or start of line on the left, and field delimiter or end of line on the right) (a raw search could lead initiating the complex search when 2 occurences are found, otherwise the number of pairs is treated as 0) - if single quote pairs > double quote pairs then use single quotes as text delimiter - else use double quotes as text delimiter
reassigning features and enhancements to user requirements@openoffice.org which will be the default owner for those tasks (was introduced some time ago)
Some thoughts about this issue: Autodetect delimiters would be great. So when the import dialog opens, the detected delimiters should be already set. So all I have to do is clicking ok after I checked whether the autodetection was correct. Also there should be a commandline option to provide all necessary csv import options, so in this case no dialog would appear. The last part is very important for automating some processes in companies. Related issues are: 72981, 97416 and others
I think the first approach should be: 1) For Text Import on Open, use the last used settings for the given filename! I.e. keep a cached mapping with the document history. This may be impractical. 2) Otherwise, detect. The detection of text delimiter should perhaps precede column delimiters, because there is much greater variability in the latter, and it is hard to determine without having first stripped out quoted/escaped portions. Assuming there is no quoting (i.e. it has been stripped, or it does not exist in the first place), one approach to finding column delimiters would be to find all non-alphanumeric characters which have constant frequency >= 1 on all lines of the file. Or if we can't strip quoted text, find all non-alphanumeric, non-quote characters which have a constant minimum value >= 1 on each line. Heuristics or a machine learning approach might then select which result is most appropriate in case of conflict. Alternatively, brute force it: determine which (common?) pairs of delimiters give the text integrity (same number of cols per line), and then use heuristics to decide (e.g. prefer quoted over unquoted; tab over semicolon, comma or space; more columns over fewer?). If OOo could acquire a collection of test text spreadsheets it might be helpful!
No changes to this in over a year? This is a sticking point since Excel can open CSV's without import utility....? It is big for my workplace...
Text Import dialog window remembers user settings. A company will have a certain CSV format so IMHO no need for guess work algorithm.