Bug 48872 - Issues with poi.ss.usermodel.DataFormatter and DateUtil
Summary: Issues with poi.ss.usermodel.DataFormatter and DateUtil
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.6-FINAL
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2010-03-08 20:47 UTC by Robert Kish
Modified: 2010-06-01 11:33 UTC (History)
1 user (show)

Notes, code examples, and sample data. (27.80 KB, application/x-zip-compressed)
2010-03-08 20:47 UTC, Robert Kish

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Kish 2010-03-08 20:47:37 UTC
Created attachment 25104 [details]
Notes, code examples, and sample data.

I'm working on a project where I try to take an Excel 97 or an Excel 07 file and export it as a CSV file using Java code. For testing, we use Excel's save as CSV converter. I have noted many discrepancies and have provided some corrections/additions/and hacks to address them. I would like to share my findings with someone here so that someone who is willing and able can merge them properly into the POI project.

DateUtil issue
1. The method isADateFormat does not know how to handle quoted values, for example mm"/"dd"/"yyyy.

DataFormatter issues

2. Format strings for zero values are not properly handled. They should use the third component (delimited by ;) if available.

3. Color checking is too simple - it removes all text in [] when it should probably limit itself to the color words. This has an effect if trying to implement elapsed time [h], [m], and [s].

4. Some stupid date formats are not handled by the createFormat method:
a. m\\/d\\/yyyy 
b. mm"/"dd"/"yyyy

5. Elapsed time processing is missing.

6. It appears that _( and _) processing does not match with Excel. Trying to format using built in formats does not work as  expected when _( and _) are corrected. (In other words, the built in format Strings have to be modified too)

7. * padding is not working as expected.

8. Fractions are not being processed, and I've tried and it's really ugly to implement the same as Excel, but they could be replaced with spaces instead of left as ?.

9. If the resulting format does not contain any # or 0, then the number still gets appended to the format string. This happens most often when trying to print a 0 value using a third position format - many formats just print - or blank. and never "-0".

10. Excel appears to use half_up rounding scheme, and not the DecimalFormat default.

11. formatRawCellContents needs a way to send it the 1904 windowing parameter.

12. An invalid date in Excel processing should be returned as 255 #s (when doing CSV conversion).

Attached items:
Shareable Notes.txt
Has some of these notes with a little more wording and some cut/paste code examples.

is my example of this DataFormatter class coded with my hacks. It should be used as a reference.

All Excel Formats.xls
An Excel 97 workbook containing 1 example of all formats available under "custom" option. This is just a sampling, because I know more can be generated if I go into individual options.

All Excel Formats.csv
The above file saved as a CSV.

All Excel Formats.xlsx
The above file saved as a 2007 Excel.

All Excel Formats.0.csv
My program's output from processing the XLS or XLSX file.
I custom formatted dates to show 4 digit years and 2 digit hours, etc.
The discrepancies are:
	There's no 1/0/1900 date but instead 12/31/1899.
	Exponent processing just doesn't work the same way - Java doesn't round digits after decimal.
	Fractions... I tried using BigFraction class, but pulled it out from this release - Excel is weird - it rounds down except when it gets close to next whole number.
Comment 1 Nick Burch 2010-06-01 11:33:21 UTC
Thanks for all the helpful investigations and explanations in your bug report and attached notes

I've added several new features to DataFormatter inspired by your work

I believe that almost everything you mentioned is now properly supported, except for two bits. Firstly the padding / whitespace bit, which should now be properly ignored (because we don't know about the fonts of column widths, I didn't think we could really do anything else). Secondly fractions still need support, probably via lazy loading of commons math if available.

If you take a look at src/testcases/org/apache/poi/ss/usermodel/TestDataFormatter.java you should see tests for much of the new support

If you spot anything that we're still not doing correctly, please do open a new bug to report it. If possible, please also upload a failing unit test (probably a new test method of TestDataFormatter.java), so we've got something to test against!