Created attachment 34173 [details] example file On https://issues.apache.org/jira/browse/TIKA-2013, I noticed a regression in currency formatting in at least some xls files when comparing 3.15-beta1 with 3.15-beta3-rc1. I confirmed that this is a poi-level issue, not something that we're botching in Tika. XLS2CSVmra results with trunk: ,"Apple Day",$963.01,$7,61,$12,96,$8,99,$6,71,$8,09,$4,27,$3,98,$2,80, ,"Bingo",$1,184.00,$8,67,$18,69,,,,,,,
Could you provide a diff of the beta1 and beta3-rc1 output or a unit test to make it easier to find the breaking change? There were some changes to DataFormatter between these releases that improved POI's locale-specific number formatting (comma is used as decimal separator rather than a period). My guess is that this commit [1] caused the regression. Can't quote the numbers as that would give a string. Either change Xls2Csv to set the locale to format with a period separator when the document's locale or the machine's locale use comma decimal separator, or escape the commas and hope the CSV consumer knows how to unescape it. [1] https://svn.apache.org/viewvc?view=revision&revision=1747139
> $1,184.00 Looks like thousands separator commas need to be striped as well. Question for folks who are used to seeing numbers with comma decimal separators: how are CSV files usually read and written? Are numbers represented with period decimal separators?
For CSV output, wouldn't it be better not to use thousands separators at all? That is, $1,184.00 could be written as $1184.00. Otherwise, to keep the comma separator for the thousands would require the value to appear in double quotes ("$1,184.00").
Looks like this is caused by a typo in http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/ss/usermodel/BuiltinFormats.java?r1=1721901&r2=1747325 -putFormat(m, 7, "\"$\"#,##0.00_);(\"$\"#,##0.00)"); +"\"$\"#,##,00_);(\"$\"#,##0.00)",
r1760102 I didn't bother adding a unit test because this seems like such a small issue. Please reopen if we need a test.