Bug 56106

Summary: End of Line character in formula throws exception in XSSFFormulaEvaluator.evaluate()
Product: POI Reporter: David Crocker <david.crocker>
Component: XSSFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: normal CC: david.crocker
Priority: P2    
Version: 3.9-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: this file has lots of hidden line breaks.
Simple .xlsx for testing bug 56106
POIBug56106.xls

Description David Crocker 2014-02-04 22:12:47 UTC
Created attachment 31282 [details]
this file has lots of hidden line breaks.

Our spreadsheets have long formulas in them, so I spent some effort formatting them to be able to make sense out of them.  I used Shift+Enter to produce the soft line break.  See below for an example:

IF(
    INDEX('TRB Record'!C$2:C$61,ROW()*2-5,,1)="",
    "",
    INDEX('TRB Record'!C$2:C$61,ROW()*2-5,,1)
   )

Microsoft Excel handles the line breaks without any trouble, and they are easy to read and debug.  (Okay, I'm uptight.) But hidden in these formulae are the end of line characters ['\n'].

And unfortunately, the POI formula parser breaks when it reaches one of these special EOL characters.  I'm using Eclipse, and I can see the string array reporting the offending EOLs.  Here's the text of the Exception (with Eclipse using the formatting when it reports):

org.apache.poi.ss.formula.FormulaParseException: Parse error near char 3 '
' in specified formula 'IF(
       INDEX('TRB Record'!C$2:C$61,ROW()*2-5,,1)="",
       "",
       INDEX('TRB Record'!C$2:C$61,ROW()*2-5,,1)
      )'. Expected cell ref or constant literal

Here's a code snippet that produces the Exception:

                XSSFSheet sheet = workbook.getSheet(worksheet);
                XSSFFormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();
...
                int ctype = 0;
                Cell cell;
...
                        Row row = sheet.getRow(rowCtr);//rows.next();
...
                                        cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL);
...
                                                ctype = evaluator.evaluateFormulaCell(cell);


My planned work-around for this is to wrap the call in a function that strips out the offending characters, builds a temporary cell with the new formula string, and return it.  Then I'll be able to run the evaluate() function against it for processing.
Comment 1 David Crocker 2014-02-04 22:21:22 UTC
This starts evaluating on 'Digestion'!A3 and breaks at 'Average whole mass closure'!B3.
Comment 2 David Crocker 2014-02-05 16:01:43 UTC
Created attachment 31286 [details]
Simple .xlsx for testing bug 56106

This simple file has a formula with '\n' EOL breaks.  Use it for building unit tests.
Comment 3 Nick Burch 2014-02-05 16:05:31 UTC
Are you able to create a .xls file with the line breaks in it too? Ideally the same .xlsx file, saved as .xls

(Depending on if .xls supports it, and if so how, may affect the overall fix)
Comment 4 David Crocker 2014-02-05 16:09:14 UTC
Created attachment 31287 [details]
POIBug56106.xls

Use this simple XLS file to test bug 56106.
Comment 5 Nick Burch 2014-02-05 17:05:19 UTC
Thanks for that!

Next two questions:
 * Can POI parse the formula in the .xls version? Without error? And correctly?
 * Can you use BiffViewer to see how excel encoded the newline into the .xls file? (.xls doesn't store formulas as strings, but as Ptgs, while .xlsx stores the formula string)
Comment 6 David Crocker 2014-02-05 19:23:30 UTC
This might be vapor.  I discovered that tika was taking control of the evaluation and I had an older version referenced.  I updated to tika 1.4, and it still doesn't work, but for a different reason.  The exception, now, is "Incomplete code - don't know how to support the 'area_num' parameter yet".

The stack trace goes like this:

org.apache.poi.ss.formula.functions.Index line 87
Index line 118
OperationEvaluatorFactory line 132
WorkbookEvaluator line 525
...
Comment 7 David Crocker 2014-02-05 20:38:28 UTC
Here is the code for the incomplete function in Index:

	public ValueEval evaluate(int srcRowIndex, int srcColumnIndex, ValueEval arg0, ValueEval arg1,
			ValueEval arg2, ValueEval arg3) {
		throw new RuntimeException("Incomplete code"
				+ " - don't know how to support the 'area_num' parameter yet)");
		// Excel expression might look like this "INDEX( (A1:B4, C3:D6, D2:E5 ), 1, 2, 3)
		// In this example, the 3rd area would be used i.e. D2:E5, and the overall result would be E2
		// Token array might be encoded like this: MemAreaPtg, AreaPtg, AreaPtg, UnionPtg, UnionPtg, ParenthesesPtg
		// The formula parser doesn't seem to support this yet. Not sure if the evaluator does either
	}

Here is some pseudocode that starts to address the problem:

if arg3 is an integer {
   ValueEval _arg0SubRange = arg0[arg3]
   if arg1 not null and arg1 != 0 {
      if arg2 is null or arg2 == 0 {
         if row[arg1] not out of bounds in _arg0SubRange {
            return cell range _arg0SubRange[row[arg1]]
         } else {
            return #REF
         }
      } else {
         if cell[arg1,arg2] not out of bounds in _arg0SubRange {
            return cell reference _arg0SubRange[arg1,arg2]
         } else {
            return #REF
         }
      }
   } else {
      if arg2 is null or arg2 == 0 {
         return #REF
      } else {
         if col[arg2] not out of bounds in _arg0SubRange {
            return cell range _arg0SubRange[col[arg2]]
         } else {
            return #REF
         }
      }
   }
} else {
   return #REF
}

Here's the help detail from Excel:

Reference form
Description
Returns the reference of the cell at the intersection of a particular row and column. If the reference is made up of nonadjacent selections, you can pick the selection to look in.

Syntax
INDEX(reference, row_num, [column_num], [area_num])The INDEX function syntax has the following arguments (argument: A value that provides information to an action, an event, a method, a property, a function, or a procedure.):

Reference  Required. A reference to one or more cell ranges.
If you are entering a nonadjacent range for the reference, enclose reference in parentheses.

If each area in reference contains only one row or column, the row_num or column_num argument, respectively, is optional. For example, for a single row reference, use INDEX(reference,,column_num).
Row_num  Required. The number of the row in reference from which to return a reference.
Column_num  Optional. The number of the column in reference from which to return a reference.
Area_num  Optional. Selects a range in reference from which to return the intersection of row_num and column_num. The first area selected or entered is numbered 1, the second is 2, and so on. If area_num is omitted, INDEX uses area 1.
For example, if reference describes the cells (A1:B4,D1:E4,G1:H4), then area_num 1 is the range A1:B4, area_num 2 is the range D1:E4, and area_num 3 is the range G1:H4.
Remark
After reference and area_num have selected a particular range, row_num and column_num select a particular cell: row_num 1 is the first row in the range, column_num 1 is the first column, and so on. The reference returned by INDEX is the intersection of row_num and column_num. 
If you set row_num or column_num to 0 (zero), INDEX returns the reference for the entire column or row, respectively.

Row_num, column_num, and area_num must point to a cell within reference; otherwise, INDEX returns the #REF! error value. If row_num and column_num are omitted, INDEX returns the area in reference specified by area_num. 
The result of the INDEX function is a reference and is interpreted as such by other formulas. Depending on the formula, the return value of INDEX may be used as a reference or as a value. For example, the formula CELL("width",INDEX(A1:B2,1,2)) is equivalent to CELL("width",B1). The CELL function uses the return value of INDEX as a cell reference. On the other hand, a formula such as 2*INDEX(A1:B2,1,2) translates the return value of INDEX into the number in cell B1.