Bug 60422 - DataFormatter.formatCellValue returns incorrect value for German 'Buchhaltung' format
Summary: DataFormatter.formatCellValue returns incorrect value for German 'Buchhaltung...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: 3.15-FINAL
Hardware: PC All
: P2 regression with 5 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-27 09:57 UTC by RBRi
Modified: 2017-07-11 11:15 UTC (History)
0 users



Attachments
the input file (15.00 KB, application/vnd.ms-excel)
2016-11-27 09:57 UTC, RBRi
Details
Screenshot of the format dialog (57.54 KB, image/png)
2016-11-27 09:58 UTC, RBRi
Details
patch adding tests for german locale to TestDataFormatter (1.01 KB, patch)
2017-07-01 12:30 UTC, RBRi
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description RBRi 2016-11-27 09:57:26 UTC
Created attachment 34481 [details]
the input file

I have some code that tries to reproduce the visible values of a spreadsheet. So i made a small test sheet with cells using different formats.
Works fine so far.

But starting with version 3.14 i got strange results for one cell.

The cell is formatted as 'Buchhaltung' (see screenshot).
The (visible) value of the cell is '4,33 €' in excel and '0,004,,33.00 €' when using poi.

My code is this / the locale used is Locale.GERMAN:

  /**
   * Reads the content of an excel cell and converts it into the string
   * visible in the excel sheet.
   *
   * @param aRow the row
   * @param aColumnsNo the column
   * @param aFormulaEvaluator the formula Evaluator
   * @param aLocale used for parsing and formating
   * @return the display string
   */
  public static String readCellContentAsString(final Row aRow, final int aColumnsNo,
      final FormulaEvaluator aFormulaEvaluator, final Locale aLocale) {
    final Cell tmpCell = aRow.getCell(aColumnsNo);
    if (null == tmpCell) {
      return null;
    }

    final DataFormatter tmpDataFormatter = new DataFormatter(aLocale);
    try {
      final String tmpResult = tmpDataFormatter.formatCellValue(tmpCell, aFormulaEvaluator);
      return tmpResult;
    } catch (final NotImplementedException e) {
      final StringBuilder tmpMsg = new StringBuilder(e.getMessage());
      if (null != e.getCause()) {
        tmpMsg.append(" (");
        tmpMsg.append(e.getCause().toString());
        tmpMsg.append(')');
      }
      LOG.error(tmpMsg.toString());
      final String tmpResult = tmpDataFormatter.formatCellValue(tmpCell, null);
      return tmpResult;
    }
  }
Comment 1 RBRi 2016-11-27 09:58:19 UTC
Created attachment 34482 [details]
Screenshot of the format dialog
Comment 2 WI 2017-06-22 12:06:43 UTC
C'mon guys, this library is useless if it only works for US/English. 8 months passed and not a SINGLE response? Is this project dead and we should look for alternatives?
Comment 3 Mark Murphy 2017-06-22 12:45:03 UTC
(In reply to WI from comment #2)
> C'mon guys, this library is useless if it only works for US/English. 8
> months passed and not a SINGLE response? Is this project dead and we should
> look for alternatives?

I'm sorry, I don't know German, so it is difficult for me to help you. 

This project is certainly not dead. The devs here all have jobs other than POI, and work on this in their free time, or provide patches that pertain directly to their jobs. So our priorities may not be the same as yours. However, we are happy to review patches submitted from the community. 

If you decide to contribute to the project, be sure to read the contribution guidelines here https://poi.apache.org/guidelines.html, and include unit tests, examples, and updates to the 'How To' document with your patch.
Comment 4 PJ Fanning 2017-06-22 17:04:09 UTC
I added a test case and potential fix in https://github.com/apache/poi/pull/60
Comment 5 RBRi 2017-06-22 17:35:00 UTC
> I'm sorry, I don't know German, so it is difficult for me to help you.
Maybe yes but that was the reason why i spend some time to make a reproducible case for you. I have attached the spreadsheet, have added the complete source code for reproducing the problem and switching the default locale of the JVM is no rocket science.
I think for open source software we have to take special care of different user locales - that is part of the story.
And maybe this is a hint for your ci test suite also; maybe it is a good idea to do the regression test with different locales.
And yes i know that your (we - because i spend also a lot of time doing open source software development) are short of time. But maybe a little hint or response will be great or even better that you have tried to reproduce my case. 

Thanks for working on the this great project - it is a real help.
Comment 6 Andreas Beeker 2017-06-22 21:19:53 UTC
Although I understand German ... I don't understand the potential fix :)

How about not using Locale.US directly, but setting it via LocaleUtil.setUserLocale() before calling the formatter.

 ... so it looks like, the format strings are always parsed in a Locale.US-manner?

Btw. it's ok to remind us, if a bug entry is important to you ... but we are simply too few people to catch up with the 488 bugs.
Comment 7 PJ Fanning 2017-06-22 22:08:13 UTC
Andreas - it seems like the existing code works if the user sets the POI UserLocale to US.

import org.apache.poi.util.LocaleUtil;
LocaleUtil.setUserLocale(Locale.US);

Going forward, I think we should not require the user to make this LocaleUtil call.

Would it make sense to have LocaleUtil user locale default to Locale.US?
Comment 8 Andreas Beeker 2017-06-22 23:19:50 UTC
First of all, I know that DataFormatter is not trivial and we also had some discussions about it [1]

When using an unspecific default locale, we usually use Locale.ROOT - but I'm not sure, if using .ROOT or .US as the default for LocaleUtil.getUserLocale() (instead of Locale.getDefault()) would be less pain, especially since there a quite a few LocaleUtil invocations.

So how to proceed from here? ...
- check if Locale.US is working for other localized excel formats
- or what's wrong with the "Buchhaltung" format "_-* #.##0,00 "€"_-;-* #.##0,00 "€"_-;_-* "-"?? "€"_-;_-@_-" ... it looks like custom formats need to be processed with the Locale of the file - where is it defined? (HPSF has sometimes a Locale entry)



[1] http://apache-poi.1045710.n5.nabble.com/DataFormatter-vs-org-apache-poi-ss-format-classes-td5721100.html
Comment 9 Javen O'Neal 2017-06-23 02:13:30 UTC
The changes in https://github.com/apache/poi/pull/60 look fine to me. Should we continue the discussion til after the 3.17 beta 1 release?
Comment 10 Javen O'Neal 2017-06-23 02:22:26 UTC
Calling LocaleUtil.setUserLocale inside the CellNumberFormatter.formatValue means that the function has a side effect, which may be surprising to a user.
If we restore the UserLocale at the exit of formatValue, then there are still side effects from parallel execution.

Rather than picking between ROOT and US as the default Locale, we could alias the "Microsoft Excel default locale", using that locale throughout our code, which would mean the only place we'd have to change it is where that alias is assigned. We could also let the user change that alias if that's needed for any reason.
setUserLocale and setDefaultLocale...
Comment 11 Javen O'Neal 2017-06-23 02:36:46 UTC
Calling LocaleUtil.setUserLocale inside the CellNumberFormatter.formatValue means that the function has a side effect, which may be surprising to a user.
If we restore the UserLocale at the exit of formatValue, then there are still side effects from parallel execution.

Are there parts of the Microsoft Office file format spec that use the user's OS language and locale, some that use language and locale that are set in Microsoft Office by the user or by the installation medium, and some that are hard-coded to a particular value?

If these are distinct concepts, perhaps we should have different getters and setters in LocaleUtil, and we can use the appropriate kind of locale (user, os, hard-coded) throughout POI according to how Microsoft Office interprets the files.

Rather than picking between ROOT and US as the default Locale, we could alias the "Microsoft Excel default locale", using that locale throughout our code, which would mean the only place we'd have to change it is where that alias is assigned. We could also let the user change that alias if that's needed for any reason.
setUserLocale and setDefaultLocale...
Comment 12 RBRi 2017-06-23 05:41:19 UTC
>- check if Locale.US is working for other localized excel formats
I think that is the base line of this problem. We have to understand if the format description is locale dependent or not. If you need someone to provide more german excel files you can count on me.

And regarding the locale. My program processes the excel file on a machine different from the one in was created on (i guess this is the usual use case for POI). This implies from my point of view, that i have to know the locale setting of the creating machine if i like to reproduce the same presentations of the cell values. That is the reason, for the locale i hand over to the method and in the end to the DataFormatter constructor. From the user point of view this is all i can do and all i like to do.
Comment 13 PJ Fanning 2017-06-24 00:29:55 UTC
I like Javen's idea of adding something like a LocaleUtil set/getDefaultLocale.

I'm wondering if it would be useful to add extra tests like this sample I just produced:

    @Test
    public void testDateFormattingWithLocales() {
        // 2017-12-01 09:54:33 which is 42747.412892397523 as double
        DataFormatter dfDE = new DataFormatter(Locale.GERMANY);
        DataFormatter dfZH = new DataFormatter(Locale.PRC);
        DataFormatter dfIE = new DataFormatter(new Locale("GA", "IE"));
        double date = 42747.412892397523;
        String format = "dd MMMM yyyy HH:mm:ss";
        assertEquals("12 Januar 2017 09:54:33", dfDE.formatRawCellContents(date, -1, format));
        assertEquals("12 \u4E00\u6708 2017 09:54:33", dfZH.formatRawCellContents(date, -1, format));
        assertEquals("12 Ean\u00E1ir 2017 09:54:33", dfIE.formatRawCellContents(date, -1, format));
    }
Comment 14 PJ Fanning 2017-06-28 12:03:36 UTC
I've been doing a bit more experimentation.
It seems that simpler formats are handled correctly, locale wise.
I think some of the main issues are:
* DataFormatter should set the LocaleUtils user locale to match the DataFormatter locale if set (and if LocaleUtils does not have an explicit locale set already)
* org.apache.poi.ss.format.CellNumberFormatter has a lot of hardcoding about commas for grouping separators and dots for decimal separators - should use DecimalFormatSymbols class
* org.apache.poi.ss.format.CellNumberFormatter also applies the separator from the format string but my understanding is that "#,##0.00" means to apply locale specific separators, that the '.' is assumed to mean apply locale specific decimal separator as oppposed to specifically apply '.' as the decimal separator.
* need to check if org.apache.poi.ss.format.CellNumberFormatter assumes 3 digits in each group - in India, they have lakhs and write 1 million (10 lakh) as 10,00,000
Comment 15 PJ Fanning 2017-06-28 12:25:26 UTC
Actally, it seems that I'm wrong about the '.' in the format string - it does seem that this should be a '.' in the result and that if you want ',' in the output then your format must be something like '#.##0,00'.
I've changed my Mac's locale to Germany and the Accounting/Buchhaltung format does appear to be '#.##0,00'. I'll need to double check the parsing code that I used to extract the formatString from attached xls file because that seems to give me a US locale formatString instead of the Germany equivalent.
Comment 16 PJ Fanning 2017-06-28 18:24:21 UTC
Looking at wet_test.xls, the 4.33 cell has this format.

 CELL_RECORD_TYPE
    .formatindex     = aa

This cell format is added as a custom format in FormatTrackingHSSFListener.
The Format String is:
_-* #,##0.00\ "€"_-;\-* #,##0.00\ "€"_-;_-* "-"??\ "€"_-;_-@_-

This is not what Excel run in Germany locale on My Mac shows - it shows this for Accounting/Buchhaltung:
_-* #.##0,00 €_-;-* #.##0,00 €_-;_-* "-"?? €_-;_-@_-

Note that the decimal and grouping separators do match up.

I don't know much about xls data format so I saved the file as an xlsx and checked the xl/styles.xml and the format string seems to match up to the #,##0.00 format as opposed to #.##0,00.

Does anyone have any insight on this?
Comment 17 RBRi 2017-06-30 16:57:42 UTC
In general Excel respects the locale settings (at least on Win it is possible to configure this independently from the locale to whatever you like). But i have not seen any problems when exchanging documents between different settings. So my good-feeling is:
The format string stored inside the file uses chars with a fixed semantic (means there are some special chars like -,#,0, the dot, the comma and so on).

I would expect that the format string parser has a list of constants (chars) acting as tokens for the parser.

> This cell format is added as a custom format in FormatTrackingHSSFListener.
> The Format String is:
> _-* #,##0.00\ "€"_-;\-* #,##0.00\ "€"_-;_-* "-"??\ "€"_-;_-@_-

The next step is the display. MS seems to have made the decision to always use a localized display. If this is clever or not (at least when showing the format string might be a different story).

> This is not what Excel run in Germany locale on My Mac shows - it shows this for > Accounting/Buchhaltung:
> _-* #.##0,00 €_-;-* #.##0,00 €_-;_-* "-"?? €_-;_-@_-

So i can confirm the effect here. If you like to proof this you can change the decimal delimiter to e.g. 'y' and the display of the number in excel itself and the display of the format string changes in my case it looks like this
_-* #.##0y00 €_-;-* #.##0y00 €_-;_-* "-"?? €_-;_-@_-

From the POI API point of view i like to see this behavior:
   final DataFormatter tmpDataFormatter = new DataFormatter(aLocale);
The locale provided here represents the windows system settings, if i change eg.g the decimal separator here the following call of 
    final String tmpResult = tmpDataFormatter.formatCellValue(tmpCell, aFormulaEvaluator);
should respect my separator.
Comment 18 PJ Fanning 2017-07-01 10:33:58 UTC
I've published a new attempt at fixing this at https://github.com/apache/poi/pull/62
Comment 19 RBRi 2017-07-01 12:28:42 UTC
Made a first patch for adding tests for german locale to the test. Please give me a sign, if you like more of this.

Looks like the expectations for testPaddingSpaces() for us are also not correct.
Comment 20 RBRi 2017-07-01 12:30:36 UTC
Created attachment 35092 [details]
patch adding tests for german locale to TestDataFormatter

patch adding tests for german locale to TestDataFormatter
Comment 21 PJ Fanning 2017-07-01 19:02:48 UTC
RBRi - would you be able to try the proposed fix that I have posted to https://github.com/apache/poi/pull/62 ?
Comment 22 RBRi 2017-07-02 07:57:02 UTC
(In reply to PJ Fanning from comment #21)
> RBRi - would you be able to try the proposed fix that I have posted to
> https://github.com/apache/poi/pull/62 ?

Looks good, the new test passing.
Comment 23 PJ Fanning 2017-07-03 21:53:56 UTC
https://svn.apache.org/viewvc?view=revision&revision=1800713

I will follow up with new issues to cover extra cases, eg Indian locale number formatting, trying to get some of the commented out test cases to work in TestDataFormatter and taking a look at reducing the amount the formatting code.
Comment 24 WI 2017-07-11 11:12:36 UTC
FWIW:I think this bug/issue was introduced with the below commit 

https://bz.apache.org/bugzilla/show_bug.cgi?id=58326 (Commit 1701688 in the SVN)

Index: CellNumberFormatter.java
===================================================================
--- CellNumberFormatter.java	(revision 1701687)
+++ CellNumberFormatter.java	(revision 1701688)
@@ -17,6 +17,7 @@
 package org.apache.poi.ss.format;
 
 import java.text.DecimalFormat;
+import java.text.DecimalFormatSymbols;
 import java.text.FieldPosition;
 import java.util.BitSet;
 import java.util.Collections;
@@ -30,6 +31,7 @@
 import java.util.regex.Matcher;
 
 import org.apache.poi.ss.format.CellFormatPart.PartHandler;
+import org.apache.poi.util.LocaleUtil;
 
 /**
  * This class implements printing out a value using a number format.
@@ -182,8 +184,8 @@
         private char insertSignForExponent;
 
         public String handlePart(Matcher m, String part, CellFormatType type,
-                StringBuffer desc) {
-            int pos = desc.length();
+                StringBuffer descBuf) {
+            int pos = descBuf.length();
             char firstCh = part.charAt(0);
             switch (firstCh) {
             case 'e':
@@ -203,7 +205,7 @@
             case '#':
                 if (insertSignForExponent != '\0') {
                     specials.add(new Special(insertSignForExponent, pos));
-                    desc.append(insertSignForExponent);
+                    descBuf.append(insertSignForExponent);
                     insertSignForExponent = '\0';
                     pos++;
                 }
@@ -354,7 +356,8 @@
             fmtBuf.append('E');
             placeZeros(fmtBuf, exponentSpecials.subList(2,
                     exponentSpecials.size()));
-            decimalFmt = new DecimalFormat(fmtBuf.toString());
+            DecimalFormatSymbols dfs = DecimalFormatSymbols.getInstance(LocaleUtil.getUserLocale());
+            decimalFmt = new DecimalFormat(fmtBuf.toString(), dfs);
         }
 
         if (exponent != null)
@@ -594,7 +597,7 @@
             writeFraction(value, null, fractional, output, mods);
         } else {
             StringBuffer result = new StringBuffer();
-            Formatter f = new Formatter(result);
+            Formatter f = new Formatter(result, LOCALE);
             try {
                 f.format(LOCALE, printfFmt, value);
             } finally {
@@ -767,6 +770,7 @@
         writeInteger(exponentNum, output, exponentDigitSpecials, mods, false);
     }
 
+    @SuppressWarnings("unchecked")
     private void writeFraction(double value, StringBuffer result,
             double fractional, StringBuffer output, Set<StringMod> mods) {
 
@@ -869,7 +873,7 @@
             List<Special> numSpecials, Set<StringMod> mods) {
 
         StringBuffer sb = new StringBuffer();
-        Formatter formatter = new Formatter(sb);
+        Formatter formatter = new Formatter(sb, LOCALE);
         try {
             formatter.format(LOCALE, fmt, num);
         } finally {


Which change the way the locale is obtained. This I think brought the "123.456,00" into the format, which then has brings new issues in CellNumberFormatter.java inside the "writeInteger" method, because the "." and "," are hardcoded there. 

As far as I get it, this method removes leading and trailing zeros from format (i.e. 0000123,567.00" and adds/removes as well as replace the "," and "." with locale specific symbols
Comment 25 WI 2017-07-11 11:15:19 UTC
And in writeFractional as well.