Bug 61053

Summary: Add missing built-in numFmt
Product: POI Reporter: Tim Allison <tallison>
Component: XSSFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: normal    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: triggering doc from common crawl

Description Tim Allison 2017-04-27 19:50:50 UTC
Created attachment 34962 [details]
triggering doc from common crawl

I found a handful of NPEs caused by missing built-in numfmts in some xlsb files.  According to: http://stackoverflow.com/questions/4730152/what-indicates-an-office-open-xml-cell-contains-a-date-time-value, it looks like there are some built in East-Asian formats we're missing... up through 164?!

The example file requires 57 and 58 (at least).  We currently go up to 49.  

Any idea how we can find the rest?
Comment 1 Greg Woolsey 2017-04-27 21:15:30 UTC
In the OOXML specification document starting on page 1767 (I have the 4th edition downloaded, I see the 5th edition is available [1]):

18.8.30 numFmt (Number Format)
This element specifies number format properties which indicate how to format and render the numeric value of a cell.

Following is a listing of number formats whose formatCode value is implied rather than explicitly saved in the file. In this case a numFmtId value is written on the xf record, but no corresponding numFmt element is written. Some of these Ids can be interpreted differently, depending on the UI language of the implementing
application.

Ids not specified in the listing, such as 5, 6, 7, and 8, shall follow the number format specified by the formatCode attribute. 

So it looks like for example numFmtId=164 should have the formatCode attribute set in the XML?

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-376,%20Fourth%20Edition,%20Part%201%20-%20Fundamentals%20And%20Markup%20Language%20Reference.zip
Comment 2 Tim Allison 2017-04-28 11:29:56 UTC
Thank you, Greg!

In that section of the spec, they include 57 and 58 which are the ones the attached file needs.  Problem solved!!!  

Wait... :)

It looks like our built in formats should include language-country variants.

e.g. 33...

zh-tw: hh"時"mm"分"ss"秒" 
zh-cn:h"时"mm"分"ss"秒"
ja-jp: h"時"mm"分"ss"秒"
ko-kr: h"시" mm"분" ss"초"
Comment 3 Tim Allison 2017-04-28 12:53:56 UTC
In a test xlsx,  I manually set a styleid=58.  We are currently hiding this problem for xlsx:

line 247 of XSSFExcelExtractor
            if (cs != null && cs.getDataFormatString() != null) {

line 364 of XSSFSheetXMLHandler
            if (this.formatString != null && n.length() > 0) {

We could identify files for testing with a slightly modified version of XSSFSheetXMLHandler in our Common Crawl corpus.


For now, I'll update the xlsb parser to have the same "silently hide" behavior, but I'll leave this issue open.
Comment 4 Tim Allison 2017-04-28 12:57:33 UTC
>So it looks like for example numFmtId=164 should have the formatCode attribute set in the XML?

Y, agreed.  The first problematic numFmtId for the triggering file was 58, which is built in, according to the spec.


> depending on the UI language of the implementing application.

When I have some time (after Tika 1.15 is out), I'll try to find xlsx examples in our corpus numFmtId < 164 and > 49 .  With any luck the language code is somewhere in the document.
Comment 5 Greg Woolsey 2017-04-28 18:56:49 UTC
(In reply to Tim Allison from comment #4)
> With any luck the language code is somewhere in the document.

The way the spec is worded, it sounds like the format is based on the language of the evaluating environment, not the document.  Perhaps like how Excel handles number separators in Windows at least, using the system settings.

If the document has a language somewhere, I think POI should at least offer an option to use it, though, since I know I often don't do any checking or setting of the default language in Java, and other users probably don't either.  I can see some applications offering a user language preference choices, and controlling display formatting based on that, however, so that may be the desired behavior.

Perhaps a formatter flag for "use document language if present" or something, defaulting to Java system language?
Comment 6 Nick Burch 2017-04-28 22:18:48 UTC
We've previously talked about capturing all these localised formats, and providing a way to have things rendered in them, but thus far no volunteers to help work out what they'd all be. Note that these would also apply for formats like [$$-409]#,###.00 where an explicit locale is set as part of the format. CellFormatPart has a few notes on this
Comment 7 Tim Allison 2017-05-01 16:37:00 UTC
Thank you, Nick!

Y, this looks like a non-trivial undertaking...

>The way the spec is worded, it sounds like the format is based on the language of the evaluating environment, not the document.  

Greg, I completely agree. It simply boggled my mind that MS wouldn't include the locale information that was used to generate the document _somewhere_ in the document, given what decade we're in.  However, I retain enough wariness to believe this might happen.  If I find some time, we could easily enough find .xlsm/.xlsx/.xlsb in our regression corpus to see what happens in practice.
Comment 8 Tim Allison 2017-06-20 13:00:22 UTC
As of r1799345, I've aligned XSSFBSheetHandler's behavior with that of the two xlsx extractors.