Bug 62904 - Simple Excel Array Formula Fails in POI
Summary: Simple Excel Array Formula Fails in POI
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 4.0.0-FINAL
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-12 11:13 UTC by stephen.friedrich
Modified: 2019-01-14 14:58 UTC (History)
0 users



Attachments
Test Excel File (7.93 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2018-11-12 11:13 UTC, stephen.friedrich
Details
patch to fix evaluation of IF with array arguments (10.16 KB, patch)
2019-01-01 17:27 UTC, Yegor Kozlov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description stephen.friedrich 2018-11-12 11:13:12 UTC
Created attachment 36259 [details]
Test Excel File

In our project we use POI to "calculate" an Excel workbook that is managed by a business analyst.

In the last revision we had very strange results and I tracked it down to POI not really supporting array formulas.
I made a test excel and project where it fails for this simple formula

    { =MIN(IF(A1:A6>=C1;A1:A6)) }

The strange thing is that it works sometimes, but fails for other values.
If this is not supported I would be OK with a hard failure as soon as the formula is evaluated.
But here there are no exceptions at all, but the formula just gives wrong results for some values.

See the example project including an automated test at https://github.com/eekboom/poi-minif

----

I also attached the test excel file to this issue.

Here's a groovy spock test that shows the problem:

    import org.apache.poi.ss.usermodel.Cell
    import org.apache.poi.ss.usermodel.FormulaEvaluator
    import org.apache.poi.ss.usermodel.Row
    import org.apache.poi.ss.usermodel.Sheet
    import org.apache.poi.ss.usermodel.Workbook
    import org.apache.poi.ss.usermodel.WorkbookFactory
    import spock.lang.Specification
    import spock.lang.Unroll
    
    
    class PoiSpec extends Specification {
    
        @Unroll
        def 'test array formula: #value -> #expectedValue'() {
            given:
            Workbook workbook = WorkbookFactory.create(Thread.currentThread().getContextClassLoader().getResourceAsStream("workbook.xlsx"))
            Sheet sheet = workbook.getSheet("Sheet1")
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            Row inputRow = sheet.getRow(0)
            Cell inputCell = inputRow.getCell(2)
    
            Row outputRow = sheet.getRow(2)
            Cell outputCell = outputRow.getCell(2)
    
            when:
            inputCell.setCellValue(value)
            formulaEvaluator.notifyUpdateCell(inputCell)
    
            then:
            formulaEvaluator.evaluate(outputCell).getNumberValue() == expectedValue
    
            where:
            value || expectedValue
            3     || 5
            5     || 5
            6     || 10
            27    || 30
    
        }
    }
Comment 1 gallon.fizik@gmail.com 2018-12-22 10:58:20 UTC
Hello Steven,
thanks for the report.

I encountered the same problem yesterday.

Reasonably enough yet unfortunately, the IF construct is being treated as "lazy if", and the first argument is evaluated as a boolean scalar, never an array, and determines the following evaluation flow. If you are interested, I can post a dirty and almost untested 'works for me' patch.

Right now I am overloaded with work but hopefully I'll address the issue properly during the holidays.
Comment 2 Yegor Kozlov 2019-01-01 17:27:41 UTC
Created attachment 36356 [details]
patch to fix evaluation of IF with array arguments

With the attached patch POI evaluates it right.

Summary of changes:
1. IF needs to implement ArrayFunction
2. moved RelationalOperationEval#evaluateArray into ArrayFunction. The logic to transpose and evaluate matrix arguments is common for all array functions and IFFunc is an example.
3. changed WorkbookEvaluator to evaluate and put the first argument on the stack in array mode. As Gallon noted, the "lazy if" logic  evaluates the first arg as a boolean scalar which further results in a wrong result. 

Unit tests are coming soon
Comment 3 Yegor Kozlov 2019-01-07 14:58:13 UTC
I checked in the fix in r1850646. The fix *improves* support for evaluation of array formulas, but full coverage is a long way ahead.  

What is included in the fix:

1. support for array arguments in the IF function. 
In a array context IF is rarely used by its own, it is usually nested in another function like {=MIN(IF(logical_test, values))} as in your example. Tests with MIN and MAX all pass. For SUM I have at least one failing test,  but it can be related to another issue. 
 
2. Support for array arguments in logical IS*** functions and unary operators. This makes it possible to evaluate formulas like {=SUM( --ISERROR(E6:E11)) }. Previously IS*** and unary operators evaluated to a scalar and the result was wrong.

Please open a new ticket or re-open this one if you find more issues with evaluation of array formulas. 

Regards,
Yegor
Comment 4 gallon.fizik@gmail.com 2019-01-07 22:03:22 UTC
@Yegor could you please look into this issue https://bz.apache.org/bugzilla/show_bug.cgi?id=63054 ?

The topic seems very close, and the divine vision of the code may still be fresh in your cache.
Comment 5 Yegor Kozlov 2019-01-08 13:08:43 UTC
Thanks for the pointer. This commit fixed bug #63054 as well. 
The exact change that fixed it was in TwoOperandNumericOperation#evaluateArray() 

We had duplicate pieces of code that evaluated array arguments:
1. ArrayEval.evaluate(srcRowIndex, srcColumnIndex, rag1, arg2)
2. RelationalOperationEval.evaluateArray(args, srcRowIndex, srcColumnIndex)

both iterate over the input arrays and evaluate (i,j)-th element from the first range with (i,j)-th element from the second range.  
The logic in RelationalOperationEval was more mature and I refactored it into a base class. ArrayEval is obsolete and can be deleted.

The test cases are in TwoOperandNumericFunctionTestCaseData.xls

Regards,
Yegor
Comment 6 Yegor Kozlov 2019-01-14 14:58:13 UTC
I checked in a follow-up to my previous commit.
Now all tests pass including the commented ones, that failed in my previous commit.

Excel has an optimization when evaluating the IF formula: depending on whether the logical condition is true/false, the PTGs for the 2nd or 3rd argument can be skipped altogether without evaluation. I had a hard time figuring out how it works in array mode until I realized that the "optimized if" logic is not applicable in array context: in array mode all arguments should be evaluated and put onto the stack and the 'skip' instructions should be ignored. Once I realized this and changed the code all failing tests started to pass.

Regards,
Yegor