Bug 60848 - SUMPRODUCT fails when first arg is of form --(...)
Summary: SUMPRODUCT fails when first arg is of form --(...)
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: 3.16-dev
Hardware: All All
: P2 normal with 2 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 64865 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-03-12 18:33 UTC by mewalig
Modified: 2023-10-13 12:27 UTC (History)
3 users (show)



Attachments
bad.xlsx (29.56 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2017-03-21 15:31 UTC, mewalig
Details
Test cases for intersection, sumproduct and index. (13.78 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2023-10-13 12:10 UTC, Jakub
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mewalig 2017-03-12 18:33:59 UTC
The following formula works fine in POI (using version 3.15):

=SUMPRODUCT(X17:X1048576,--(O17:O1048576>=30))

However, reversing the order of the args, it fails with "Invalid arg type
for SUMPRODUCT: (org.apache.poi.ss.formula.eval.ErrorEval)":

=SUMPRODUCT(--(O17:O1048576>=30),X17:X1048576)

Either way works fine in Excel.

Also described at https://lists.apache.org/list.html?dev@poi.apache.org:2017-03
Comment 1 Dominik Stadler 2017-03-19 20:20:48 UTC
Do you have a sample file/sample code or unit-test that reproduces the problem? This would make it a lot easier for developers to reproduce and work on a fix.
Comment 2 mewalig 2017-03-21 15:29:30 UTC
It's as simple as evaluating a sumproduct whose first arg starts with --.

--- Sample files ---
 bad.xlsx: in A3, set formula to
     =SUMPRODUCT(--(B5:B20))

--- Test results ---
Running the sample code below, you will get:
  java sampleApp good.xlsx
    <prints "No error">
  java sampleApp bad.xlsx
    <prints "Invalid arg type for SUMPRODUCT:(org.apache.poi.ss.formula.eval.ErrorEval)">

--- Sample code ---

    private static final String FILE_NAME = "test.xlsx";

    public static void main(String[] args) {
        try {
            FileInputStream excelFile = new FileInputStream(new File(FILE_NAME));
            Workbook workbook = new XSSFWorkbook(excelFile);

            // get the cell with the offending formula (cell A3 in sheet 0)
            Cell c = workbook.getSheetAt(0).getRow(2).getCell(0);

            FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();

            evaluator.evaluateFormulaCell(c);

            System.out.println("No error");
        } catch(Exception e) {
            System.out.println("There was an error");
            System.out.println(e.getMessage());
        }
    }
Comment 3 mewalig 2017-03-21 15:31:07 UTC
Created attachment 34860 [details]
bad.xlsx

bad.xlsx as described in prior comment. It has no contents other than one formula in cell A3: =SUMPRODUCT(--(B5:B20))
Comment 4 mewalig 2017-03-21 15:33:50 UTC
Please ignore the reference to "good.xlsx" in comment 2. It was meant to be removed before it was sent.
Comment 5 Greg Woolsey 2017-03-21 17:57:50 UTC
Thanks for the simple test case.  This is deeper than just the SUMPRODUCT() function.  By the time it gets to evaluating the function, the argument has already evaluated to an error.

The problem is that the "Unary Minus" operator, "-", doesn't properly handle an array argument.  This is actually a problem with the underlying utility class OperandResolver, which needs to return an array instead of a single value when the input is an array for cases like this, but perhaps not for some other functions.

This will require some digging to fully define.  To start with, the unary functions should at least call a new operand static method that can return an array of values.  UnaryPlusEval has the same bug as UnaryMinusEval, although the plus and minus evaluation logic is slightly different according to the comments in the code.

It will take someone digging through the Excel function specs to see what other functions can silently operate on an array and return an array for use by array functions like SUMPRODUCT, and testing/checking the POI handling of each.

Patches are always welcome.  If one of my users indicates we need this it will end up in my queue, but otherwise it's a bit too big for me to tackle at the moment.
Comment 6 mewalig 2017-03-21 18:20:50 UTC
Thank you for the comment

Out of curiosity, why is the processing wrt the first argument different from that wrt subsequent arguments? Note that if we do:
   =SUMPRODUCT(A5:A20, --(B5:B20))

then even though the problematic argument is still there, it works just be relegating it to the position of argument 2 instead of argument 1.

Is it possible to take whatever process is being applied to arguments 2+ and apply that to argument 1?
Comment 7 Greg Woolsey 2017-03-21 18:49:28 UTC
The current logic is completely dependent on the internal class type of of the first argument.  As I mentioned, the problem is with the unary negation operator (minus).  I haven't tested it, but I suspect the second argument is actually evaluating to a single value instead of an array, and projecting that across the first argument array, rather than taking the pairwise product of arrays of equal size.  There won't be a simple fix.  

You are welcome to check out the source code and play with it.  If you come up with a fix and unit test, we'd love to incorporate a patch.
Comment 8 mewalig 2017-03-21 20:03:30 UTC
I will try to take a look. the main (at least initial) limitation on my ability to do so is that I'm pretty unfamiliar with the Java build process (I'm mostly a C programmer and stick with emacs + cc + make). 

Will have to do some research on how to get code cycle times down for this kind of endeavor. Any suggestions much appreciated.
Comment 9 mewalig 2017-03-22 02:51:17 UTC
If any developers are interested in fixing this for a $ bounty, please send me a proposal at mewalig@gmail.com
Comment 10 mewalig 2017-04-21 21:43:21 UTC
the plot thickens.

Using the below example, when cell A3 is changed to be:
    =SUMPRODUCT(--(B4:B6))

then, as before, evaluation of that throws an error.

However, when cell A3 contains
    =SUMPRODUCT(--(B3:B5))

then, unlike before, no error is thrown, but the resulting value is not correct and only appears to consider the cell B3.

Not sure if this indicates two separate issues that need to be fixed, but it seems odd that one throws an error and the other does not just by moving the referenced range up or down a row.
Comment 11 mewalig 2017-04-21 21:57:56 UTC
Looking at the syntax in https://msdn.microsoft.com/en-us/library/dd906358(v=office.12).aspx, then, assuming that "Unary Minus" is equivalent in the POI grammer as "sign" in the MSFT site's grammar, it seems possible that the grammar in POI should not be classifying the leading "-" as "Unary Minus", and instead should be classifying it as whatever in the POI grammar is equivalent to "prefix-operator".

See in particular the line that reads:

  nospace-argument-expression = "("  expression  ")" / constant / prefix-operator argument-expression ...


in other words, "--(xx)" should be parsed as


  prefix-operator argument-expression
                    |
                  prefix-operator argument-expression
                                    |
                                  "("  expression  ")"


(note that I've ignored the difference between nospace-argument-expression and argument-expression and treated them as equivalent for brevity)

Would you agree that the expression should parsed differently, as shown above? I'm not familiar enough with how POI implemented the grammar to know whether there is some other way to deal with this which is different from how the MSFT site's grammar works... but I would think that the formula parsing is hairy enough that it would be optimal to use the exact same grammar as specified in the MSFT site. It might be a fundamental change to make, but there are a number of other formula-parsing bugs in POI that would likely be more easily addressed taking that approach.
Comment 12 mewalig 2018-02-02 17:54:41 UTC
Calling all coders who want more money: who is willing to fix this bug for a bounty? 

I have to admit I am surprised that there hasn't been more attention being paid to what appears to be a fundamental problem in how this library parses formulas, which not only doesn't work, but also doesn't notify the user that it doesn't work, and instead just sends back bad data. If nothing else, there should be raised some Exception raised saying "I can't calculate this correctly, so I'm returning an error". Otherwise how can anyone take this library seriously for any "real" work?

Also, why is this bug's status set to "NEEDINFO"? We already have all the info needed to reproduce. I get it that the solution isn't found yet, but "NEEDINFO" makes it sound like it's not ready for a developer to look at, which is not the case.
Comment 13 PJ Fanning 2022-02-21 10:13:38 UTC
*** Bug 64865 has been marked as a duplicate of this bug. ***
Comment 14 Jakub 2023-10-13 12:10:52 UTC
Created attachment 39124 [details]
Test cases for intersection, sumproduct and index.

I've been able to reproduce the bug, looked at the code and played a bit with similar related cases. The problem is not in the SUMPRODUCT function nor in the unary minus. It is about implementation of implicit intersections in POI. Here is some introduction to the intersections:
https://www.ablebits.com/office-addins-blog/excel-implicit-intersection/

The logic when implicit intersection happens and when not is quite complex. Here's what I've found out using Excel 2010:
- In a "normal mode" common arithmetic operations causes implicit intersections. For example assume there are values 1 to 5 in cells A1 to A5. Then the formula "=2*A1:A5" evaluated in the B2 cell returns the value of A2 multiplied by 2, i.e. 4.
- If a formula is saved using Ctrl+Shift+Enter it is stored as array formula (aka CSE formula, in excel it is wrapped in curly brackets). Implicit intersection is suppressed in CSE formulas. Let's call it an "array mode". The whole CSE formula is evaluated in array mode.
- Some functions changes the mode in which their arguments are evaluated, in particular:
-- SUMPRODUCT will force the array mode on all its arguments. In above example the formula "=SUM(2*A1:A5)" in the B2 cell returns 4 (implicit intersection happens) but "=SUMPRODUCT(2*A1:A5)" returns 30 as it doesn't intersects and sums the range multiplied by 2.
-- INDEX function will force the array mode on it's first argument. Interestingly it also switches the evaluation of its second and third argument to the normal mode. That is true if INDEX is used inside SUMPRODUCT. If INDEX is used in a CSE formula, no switching happens. Implicit intersection is suppressed on all evaluations in CSE formula.
- Some functions which also accept array arguments just keep the mode as it was, for example functions MIN or MAX functions. 

Now, POI seems to work well in the array mode (in CSE formulas) as well as in the normal mode. What is not implemented is switching of the mode by certain functions (SUMPRODUCT in this case). Note there is a piece of code attempting to do this. The org.apache.poi.ss.formula.functions.ArrayMode interface is clearly intended for that. See where it is used in the org.apache.poi.ss.formula.WorkbookEvaluator. However doesn't implement the logic above correctly. Also the INDEX is the only function implementing the ArrayMode interface.

This is what I've found out so far. It is demonstrated in the attached workbook. There is a column in the workbook showing where POI would give wrong results. It is not just the SUMPRODUCT, INDEX function alone also can give errors.

If anyone thinks it works differently or knows about some formal specification or documentation please comment.

I have some ideas of how to fix it. Seems to resolve the bugs in the attached workbook however there is lot more to investigate and test...
Comment 15 PJ Fanning 2023-10-13 12:27:59 UTC
I think it was a big mistake for POI to start evaluating formulas. It is too complicated and there are not enough devs interested. It doesn't just take someone to submit a change. It takes multiple people to review it.


Releases are irregular.

Does anyone need POI to calculate formulas? The answer is no. Excel can be made to recalculate the formulas when you load the xlsx into Excel.

See https://poi.apache.org/apidocs/dev/org/apache/poi/xssf/usermodel/XSSFWorkbook.html#setForceFormulaRecalculation-boolean-