Bug 60517 - Incorrect parsing multiple sheet in formula
Summary: Incorrect parsing multiple sheet in formula
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: unspecified
Hardware: All All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-25 16:10 UTC by Ricky
Modified: 2017-01-04 21:50 UTC (History)
0 users



Attachments
test file (84.00 KB, application/vnd.ms-excel)
2016-12-25 16:10 UTC, Ricky
Details
Simplified test case workbook XLSX (12.08 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2017-01-04 21:46 UTC, Javen O'Neal
Details
Simplified test case workbook XLS (30.00 KB, application/vnd.ms-excel)
2017-01-04 21:50 UTC, Javen O'Neal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ricky 2016-12-25 16:10:33 UTC
Created attachment 34554 [details]
test file

Summary:

In my Excel research project from EUSES data base, I found that HSSF in POI 3.15 final release didn't parse multiple sheet formula correctly.

Details:

When the multiple sheet name quoted by single quote, HSSF put single quote between every sheet name. (See at sheet: Summary in cell: D17 of formula: SUM('1003':'1856'!D28))

For example,

Formula: = SUM('1003:1856'!D28)

Here the sheet name are 1003 and 1856. When HSSF parse string, it shows:

SUM('1003':'1856'!D28)

I should expect the formula string should be :

SUM('1003:1856'!D28)

If I convert XLS into XLSX file, the new XSSF parse it correctly as SUM('1003:1856'!D28)
Comment 1 Javen O'Neal 2017-01-03 07:30:13 UTC
Could you include a code snippet of what you're doing, including any manipulations you are doing to the stored formula string and how you're getting the formula string.

FWIW, when I open your attached XLS file in LibreOffice, the following formula appears in D17:
=SUM($'1003'.D28:$'1856'.D28)

When I use LibreOffice to save the file as XLSX, I get:
=SUM('1003':$'1856'.D28:D28)

And after converting to XLSX, the value saved in sheet1.xml is:
<c r="D17" s="19" t="n">
<f aca="false">
SUM(&apos;1003&apos;:&apos;1856&apos;!D28:D28)</f>
<v>
102320067</v>
</c>

It may be possible that both versions are correct.
For example, 'Sheet1'!A1 and Sheet1!A1 are both correct. In general, single quotations are needed whenever a character in the sheet name could cause the formula to be parsed differently (whitespace, punctuation, sheet name looks like a cell reference). I would assume that '1003:1856' would not be correct as that would imply that there is a sheet named "1003:1856".
Comment 2 Ricky 2017-01-03 13:06:34 UTC
It is incorrect in your interpretation. What I refer in my previous post is multiple sheet reference, ie two sheet 1003 and 1856 of the same cells. What you did is two different cells =SUM($'1003'.D28:$'1856'.D28)

I tried to type =SUM('1003':'1856'!D28)) in Microsoft Excel 2016 Mac, but it pops up an error dialog.

I only read/parse Excel by POI no update/write in my test.

Here is part of incomplete code that is for demo purpose:

	private static HSSFWorkbook createHSSFWorkBook(String fileName) throws IOException {
	    FileInputStream fis = new FileInputStream(fileName);
	    try {
	        return new HSSFWorkbook(fis);
	    } catch (Exception e) {
	    	logger.warn("Failed to parse Excel file: " + fileName, e);
	    	return null;
	    } finally {
	        fis.close();
	    }
	}

	public static void generateJson(String sheetFilePath, String jsonFilePath) throws IOException {

		// load external Excel file by POI
		Workbook wb = null;
		if (sheetFilePath.endsWith(".xlsx"))
			wb = new XSSFWorkbook(sheetFilePath);
		else {
			wb = XlParser.createHSSFWorkBook(sheetFilePath);
			if (wb == null)
				return;
		}

		// loop through each sheet
        for(int sheetIndex = 0; sheetIndex < wb.getNumberOfSheets(); sheetIndex++) {
          	Sheet sheet = wb.getSheetAt(sheetIndex);
        	// loop through each row
	        for (Row row : sheet) {
	        	// loop through each cell
	            for (Cell cell : row) {
	            	// retrieve cell reference
	                CellReference cellRef = new CellReference(row.getRowNum(), cell.getColumnIndex());
	                switch(cell.getCellTypeEnum()) {
	                	case FORMULA:
	                		String cellRefString = cellRef.formatAsString();
	                		String formulaString = null;
	                		try {
	                			formulaString = cell.getCellFormula();
	                		} catch (Exception e) {
	                			logger.fatal("Failed to get formula in cell " + cellRefString);
	                			throw e;
	                		}

	                        break;
	                	case BOOLEAN:
	                		
	                		break;
	                	case ERROR:
	                		
	                		break;
	                	case NUMERIC:
	                		
	                		break;
	                	case STRING:
	                		
	                		break;
	                	default:
	                		break;
	                }
	            }// end of cell
	        }// end of row
        }//end of sheet
 }
Comment 3 Mark Murphy 2017-01-03 13:47:20 UTC
(In reply to Javen O'Neal from comment #1)
> Could you include a code snippet of what you're doing, including any
> manipulations you are doing to the stored formula string and how you're
> getting the formula string.
> 
> FWIW, when I open your attached XLS file in LibreOffice, the following
> formula appears in D17:
> =SUM($'1003'.D28:$'1856'.D28)
> 
> When I use LibreOffice to save the file as XLSX, I get:
> =SUM('1003':$'1856'.D28:D28)
> 
> And after converting to XLSX, the value saved in sheet1.xml is:
> <c r="D17" s="19" t="n">
> <f aca="false">
> SUM(&apos;1003&apos;:&apos;1856&apos;!D28:D28)</f>
> <v>
> 102320067</v>
> </c>
> 
> It may be possible that both versions are correct.
> For example, 'Sheet1'!A1 and Sheet1!A1 are both correct. In general, single
> quotations are needed whenever a character in the sheet name could cause the
> formula to be parsed differently (whitespace, punctuation, sheet name looks
> like a cell reference). I would assume that '1003:1856' would not be correct
> as that would imply that there is a sheet named "1003:1856".

I did not know a range could look like this.

I did some testing in Excel 2016 (Windows), and it appears that =SUM('1003:1856'!A1) is a valid range syntax where there is a tab named 1003 and another tab named 1856. All tabs physically between tabs 1003 and 1856 will be included in the sum. So if I created a tab named Sheet4 and put a value in cell A1, that cell is included in the sum only if I move tab Sheet4 to a position between tabs 1003 and 1856. This formula is maintained, including single quotes, if I save and reopen in XLS or XLSX format. If I add another tab Sheet5, and then make the formula =SUM('Sheet4:Sheet5'!A1), Excel changes it to =SUM(Sheet4:Sheet5!A1)
Comment 4 Ricky 2017-01-03 14:42:24 UTC
As you found it out, single quote is very important in multiple sheet of formula string,

1. The single quote of sheet name is kept if any characters such as #"=<>&+-%,SINGLE_SPACE exist in sheet name. eg, if you put # sign inside sheet name, ie 'Sheet#4', the single quote is preserved

2. The parsed formula string of multiple sheet from POI 3.15 is incorrect. At any time, it is illegal to quote sheet name twice =SUM('1003':'1856'!D28). The correct parse result should be SUM('1003:1856'!D28)

I tried to figure it out how POI generate parse result and patch it by myself. But IMHO the formula parsing code is not quite readable. In the absence of development document, it is better to leave maintainer to fix it.
Comment 5 Javen O'Neal 2017-01-04 21:46:09 UTC
Created attachment 34588 [details]
Simplified test case workbook XLSX
Comment 6 Javen O'Neal 2017-01-04 21:50:22 UTC
Created attachment 34589 [details]
Simplified test case workbook XLS

It looks like Excel 2013 allows sheet ranges to be unquoted if the first and last sheet do not contain special characters that would require quoting for a single sheet cell reference.

Otherwise, Excel 2013 quotes the sheet range instead of quoting the sheets individually.

Valid: Sheet1:Sheet2!A1
Valid: 'Sheet1:Sheet2'!A1
Valid: 'Sheet 1:Sheet 2'!A1
Invalid: 'Sheet1':'Sheet2'!A1
Invalid: 'Sheet1'!A1:'Sheet2'!A1
Invalid: Sheet 1:Sheet 2!A1