Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing
|Summary:||CSV Imports limits row to 65535 characters|
|Component:||open-import||Assignee:||AOO issues mailing list <issues>|
|Status:||CONFIRMED ---||QA Contact:|
|Version:||OOo 3.2.1||Keywords:||needhelp, oooqa|
|Issue Type:||DEFECT||Latest Confirmation in:||---|
Description adamsei 2010-08-04 20:02:08 UTC
When importing a CSV file OpenOffice will truncate an entry after 65536 characters. Discovered after random survey responses were found to be missing data. I selectivly modified records to determine 65536 was the limit. The included file gives a sample of the of our dataset which was exported from a third-party survey software. You will notice Harvard's entry truncates prematurely and actually causes the next entry to begin on the same line.
Comment 2 adamsei 2010-08-04 20:06:16 UTC
I failed to mention the problem has presented itself on Mac OS X (unsure version), Windows (XP through 7), and Arch Linux. In all cases OOo was 3.2.1.
Comment 3 Rainer Bielefeld 2010-08-05 07:47:50 UTC
Reproducible with "Ooo-Dev 3.3.0 multilingual version English UI WIN XP: [DEV330m1 (Build 9514)]"! Opening sample.csv wich separator "comma" in cell 'W4' of new CALC spreadsheet text ends with string "systems and to learning through advanced research". When I copy cell contents to a new WRITER document, I count 466 characters in the cell. Opening sample.csv wich with "Ooo 3.1.1 WIN XP DE[OOO310m19 (Build 9420)]" in cell 'W4' of new CALC spreadsheet text ends with string "systems and to learning through advanced research". When I copy cell contents to a new WRITER document, I count 3220 characters in the cell. Additionally I see that in 3.3 document only shows 4 rows "ResponseID", "1879", "1881", "1886", With 3.1.1: ResponseID, 1879, 1881, 1886, 1889, 1892 what might be related, but also that might be a different problem. adamsei: How did you check that 65536 characters limitation? Please always contribute a clear instruction how to find the problem! At least you have to mention your csv import settings and an explicit description what exactly your problem is. May I ask you to read our guidelines on <http://qa.openoffice.org/issue_handling/pre_submission.html> and <http://www.openoffice.org/bugs/bug_writing_guidelines.html> before you file further issues or post again here? Then please contribute a clear step by step instruction containing all observations (error messages ...), EVERY key press and EVERY mouse click how to reproduce the problem, and explain why you believe that your results are unexpected. That means (for example): do not write something like "I am not able to ...", but 6. left mouse click on … expected: …, color of … changes, … actual: no …., color remains white, no …
Comment 4 Rainer Bielefeld 2010-08-05 09:45:10 UTC
I'm sorry, there is a mistake concerning final string in my latest comment. Correct comment must be: "Opening sample.csv wich with "Ooo 3.1.1 WIN XP DE[OOO310m19 (Build 9420)]" in cell 'W4' of new CALC spreadsheet text ends with string "/oe/HU_Temp_Policy_July09.pdf". Some of the differences I saw between 3.1.1 and 3.3. were caused by using different languages. The truncating problem seems to be caused by text delimiter ' Using this one with 3.1.1, I get message "The maximum number of rows has been exceeded" (Issue 75199?), and Text in 'W4' will be truncated. 3.3 will show 'W4' contents until end "/oe/HU_Temp_Policy_July09.pdf" if I use " as delimiter. Currently I can't see any bug.
Comment 5 adamsei 2010-08-29 05:39:48 UTC
It appears I was wrong, and 65535 is the number one is limited to, 65536 is the number that breaks it. This is not a particulary surpising number since it is the maximum value of an unsigned short int. I have updated the summary accordingly. To be clear the issue is not the number of characters in a cell but rather the number of characters in a line of the CSV file. A script similar to count_public.py was used to determine character count of a line, recognize it does count non-printing characters (CR, LF). I had used python to generate psudo-random characters to flush one of our survey responses that was below the 65536 limit to the limit, and then manually added and removed characters using a text editor (mousepad and wordpad) to confirm that 65536 was the problem number. CSV files were imported with what I understand to be the settings for standard CSV file according to RFC 4180. Namely, Seperated by Comma, Text delimiter " nothing else selected. All intelligent encoding setting for the file preset the same problem, though UTF-8 has been most thoroughly tested. Since I am unable to make availabe much of our working data set and the complexity of our survey resposes seem to create some confusion, I wrote a python script to generate a suitable CSV file to illustrate the issue. These latest files were only tested on OpenOffice.org 3.2.1 OOO329m19 (Build:9505) on ArchLinux (x86_64) To reproduce open dummy_data.csv with the standard CSV settings and UTF-8 encoding. Note cell "OX1" reads "IF" where if should read "IFNOT" dummy_data.py can be used to produce additional testing files. I also noticed this limitation affects the Text database option in Base. Should a seperate Issue be opened for that?
Comment 6 adamsei 2010-08-29 05:40:46 UTC
Created attachment 71457 [details] python3 script to count characters in a line
Comment 7 adamsei 2010-08-29 05:41:45 UTC
Created attachment 71458 [details] CSV file that illustrates; Cell "OX2" is truncated
Comment 8 adamsei 2010-08-29 05:42:26 UTC
Created attachment 71459 [details] python3 script to generate psudo-random CSV files
Comment 9 Rainer Bielefeld 2010-08-29 09:06:52 UTC
Reproducible with "Ooo 3.1.1 WIN XP DE[OOO310m19 (Build 9420)]" and with "Ooo-Dev 3.3.0 multilingual version English UI WIN XP: [OOo330m1 (Build 9514)]", not in all details (because of my simplyfied test method), but for the general effect. I see a limitation opening "dummy_data.csv" with 3.1.1. In final cells 'OW' / 'OX' I see contents "TQLXHAIMWFCYKRBQWQREKAOAM" / "IF" opening sample "dummy_data.csv" Latest cell should contain "IFNOT", as I see with Editpadlite. I did an additional test with "dummy_data.csv": 1. opened document 2. marked 'OW' with contents "TQLXHAIMWFCYKRBQWQREKAOAM" 3. go to 'OY' 4. <ntrl>+<v>, contents was pasted as expected 5. marked cell moving mouse pointer with pressed left button to 'TG' (exact position does not matter) 7. Menu 'Edit -> Fill -> right' (cells will be filled with contents) 8. save as "dummy_data_new.csv" and "dummy_data_new.ods" 9. close "dummy_data_new.ods" 10. reopen both documents with CALC expected: both with new additional contents actual: "dummy_data_new.csv" adds some new contents until 'PB', further contents is truncated only "dummy_data_new.ods" with new contents until 'TG' It's an "open-import" problem, I see added cell contents in "dummy_data_new.csv" with Editpadlite. Because the limitation does not exist for .ods documents, I believe it's a bug.