Summary: | Add missing built-in numFmt | ||
---|---|---|---|
Product: | POI | Reporter: | Tim Allison <tallison> |
Component: | XSSF | Assignee: | POI Developers List <dev> |
Status: | NEW --- | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | 3.16-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | triggering doc from common crawl |
Description
Tim Allison
2017-04-27 19:50:50 UTC
In the OOXML specification document starting on page 1767 (I have the 4th edition downloaded, I see the 5th edition is available [1]): 18.8.30 numFmt (Number Format) This element specifies number format properties which indicate how to format and render the numeric value of a cell. Following is a listing of number formats whose formatCode value is implied rather than explicitly saved in the file. In this case a numFmtId value is written on the xf record, but no corresponding numFmt element is written. Some of these Ids can be interpreted differently, depending on the UI language of the implementing application. Ids not specified in the listing, such as 5, 6, 7, and 8, shall follow the number format specified by the formatCode attribute. So it looks like for example numFmtId=164 should have the formatCode attribute set in the XML? [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-376,%20Fourth%20Edition,%20Part%201%20-%20Fundamentals%20And%20Markup%20Language%20Reference.zip Thank you, Greg! In that section of the spec, they include 57 and 58 which are the ones the attached file needs. Problem solved!!! Wait... :) It looks like our built in formats should include language-country variants. e.g. 33... zh-tw: hh"時"mm"分"ss"秒" zh-cn:h"时"mm"分"ss"秒" ja-jp: h"時"mm"分"ss"秒" ko-kr: h"시" mm"분" ss"초" In a test xlsx, I manually set a styleid=58. We are currently hiding this problem for xlsx: line 247 of XSSFExcelExtractor if (cs != null && cs.getDataFormatString() != null) { line 364 of XSSFSheetXMLHandler if (this.formatString != null && n.length() > 0) { We could identify files for testing with a slightly modified version of XSSFSheetXMLHandler in our Common Crawl corpus. For now, I'll update the xlsb parser to have the same "silently hide" behavior, but I'll leave this issue open. >So it looks like for example numFmtId=164 should have the formatCode attribute set in the XML? Y, agreed. The first problematic numFmtId for the triggering file was 58, which is built in, according to the spec. > depending on the UI language of the implementing application. When I have some time (after Tika 1.15 is out), I'll try to find xlsx examples in our corpus numFmtId < 164 and > 49 . With any luck the language code is somewhere in the document. (In reply to Tim Allison from comment #4) > With any luck the language code is somewhere in the document. The way the spec is worded, it sounds like the format is based on the language of the evaluating environment, not the document. Perhaps like how Excel handles number separators in Windows at least, using the system settings. If the document has a language somewhere, I think POI should at least offer an option to use it, though, since I know I often don't do any checking or setting of the default language in Java, and other users probably don't either. I can see some applications offering a user language preference choices, and controlling display formatting based on that, however, so that may be the desired behavior. Perhaps a formatter flag for "use document language if present" or something, defaulting to Java system language? We've previously talked about capturing all these localised formats, and providing a way to have things rendered in them, but thus far no volunteers to help work out what they'd all be. Note that these would also apply for formats like [$$-409]#,###.00 where an explicit locale is set as part of the format. CellFormatPart has a few notes on this Thank you, Nick!
Y, this looks like a non-trivial undertaking...
>The way the spec is worded, it sounds like the format is based on the language of the evaluating environment, not the document.
Greg, I completely agree. It simply boggled my mind that MS wouldn't include the locale information that was used to generate the document _somewhere_ in the document, given what decade we're in. However, I retain enough wariness to believe this might happen. If I find some time, we could easily enough find .xlsm/.xlsx/.xlsb in our regression corpus to see what happens in practice.
|