Bug 48292 - [PATCH] Support of array formulas
Summary: [PATCH] Support of array formulas
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 normal with 6 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks: 46989
  Show dependency tree
 
Reported: 2009-11-26 07:10 UTC by Petr.Udalau
Modified: 2015-03-22 14:32 UTC (History)
1 user (show)



Attachments
Set/remove array formulas patch. (29.82 KB, text/plain)
2009-11-26 07:10 UTC, Petr.Udalau
Details
reworked patch (27.77 KB, text/plain)
2009-11-30 18:39 UTC, Josh Micich
Details
Final patch (139.15 KB, patch)
2009-12-02 05:54 UTC, Petr.Udalau
Details | Diff
Tests for array formulas (66.60 KB, application/zip)
2009-12-02 05:55 UTC, Petr.Udalau
Details
another draft of the patch (176.11 KB, patch)
2009-12-09 00:16 UTC, Josh Micich
Details | Diff
added test files (51.96 KB, application/gzip)
2009-12-09 00:20 UTC, Josh Micich
Details
last reworked patch (322.32 KB, text/plain)
2009-12-18 06:38 UTC, Petr.Udalau
Details
new test data (52.26 KB, application/zip)
2009-12-18 06:39 UTC, Petr.Udalau
Details
last reworked patch with test data (129.66 KB, application/zip)
2009-12-21 07:36 UTC, Petr.Udalau
Details
File from Excel 2003 that can not be re-save using POI due to Array Formula In Cell B37 (88.00 KB, application/vnd.ms-excel)
2011-12-08 15:20 UTC, Dugas du Villard
Details
28053: File from Excel 2003 that can be re-save using POI No Array Formula in this one (89.00 KB, application/vnd.ms-excel)
2011-12-08 15:21 UTC, Dugas du Villard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Petr.Udalau 2009-11-26 07:10:27 UTC
Created attachment 24623 [details]
Set/remove array formulas patch.

There is first patch with implementation of set/remove array formulas.

In the nearest future I will prepare patches with evaluation of array formulas and modifications for functions that supports array formulas and tests for all.
Comment 1 Josh Micich 2009-11-30 16:25:38 UTC
This was originally attachment ID 24632 that was lost as part of the issues.apache.org data loss on 2009-11-26/27.

Note that this attachment ID has since been re-used.

The original attachment comment was:
Some improvements to the original patch

Thanks for your work on array formulas.  This will be an important addition,
since this area is barely covered by POI at the moment (as I'm sure you're
aware).

There are a few issues with the rest of the patch, so I have re-worked what I
can and attached a new draft of the patch (based on the latest svn r885007).
Some of my fixes might be wrong, so feel free to change anything to improve the
patch.

A very small amount of your first patch was applied in svn r885006.

The biggest outstanding concern is the lack of test cases.  Please write
something that calls the new top-level APIs.  Try to make the tests execute the
newly changed/added code in all the lower classes.

In XSSFCell you wrote "if (isPartOfArrayFormulaGroup() && f == null)"
suggesting that it is possible for "isPartOfArrayFormulaGroup() to return true
with f!=null.  It seems more logical that this condition would represent a
corrupted spreadsheet.

According to OOO docs the formula parsing rules are slightly different for
array formulas.  You should use FormulaType.ARRAY, and we'll probably need to
update FormulaParser accordingly.

ConstantValueParser - it seems that there are now two alternate ways to encode
array element values of type text (with String and UnicodeString).  It would be
much better if such an ambiguity was not introduced into the code.  In
addition, UnicodeStrings can contain a list of FormattingRuns which is
something that is not applicable to array formulas (as far as I understand).
According to the current junit tests, this code is not executed anyway.

FormulaRecordAggregate - field _sharedValueManager can never be null so there
is no need to check for that.  In general don't write null checks for things
that are asserted or documented as being never null - it raises doubts as to
whether the check might really be required.

In CellRangeAddress.valueOf() you added code which presumably allows one to
write valueOf("A") instead of valueOf("A:A") but it is not clear how this is
useful or relevant.

Try to keep classes encapsulated - for example the ArrayRecords list should not
spill out of SharedValueManager, and SharedValueManager should not spill out of
FormulaRecordAggregate.

Please don't use your own internal user-id for @author tags.  Full name in
plain text is best.

I also changed some method names and javadoc on the public API for clarity.
Comment 2 Josh Micich 2009-11-30 18:39:51 UTC
Created attachment 24645 [details]
reworked patch

The patch I previously attached doesn't seem to be linked here anymore, so I am re-submitting it.

This new version of the patch has been rebased to svn r885585.
Comment 3 Petr.Udalau 2009-12-02 05:53:11 UTC
Thanks for reworked patch.
We have made final patch of array formula support and we will waiting for your review.

Notes:

1) About "if (isPartOfArrayFormulaGroup() && f == null)" in XSSFCell:
I saved Excel document with array formula and when I had opened the xml of sheet I saw that formula was set only for the first cell of array formula range and other cells contained only calculated result.

2) About parsing of array formulas:
I changed OperandClassTransformer. But may be It is not solution for this problem.

3) About CellRangeAddress.valueOf():
This is the case when the array formula is set into one cell. Then the Excel save the "ref" param of formula like a "A1"(It is only for one-cell array formula, when we set array formula into several cells "ref" param will contain range like "A1:B1").

4) About evaluation of array formulas:
There are two types of array formulas:
  1.Some formulas (for example "SQRT({1,2;3,4})") can take array arguments and have to be evaluated in loop with different args like a simple function.
  2.Functions that have to be evaluated in special mode when it is in array formula.

5) About optimized CHOOSE and IF:
Optimization IF/ Choose  was made by analyze of  first argument and evaluated only expression, represented by second argument in true case or by third argument in false case.
In Array Formula Context first argument may be Boolean array which contains:
-only true values
-only false values
-mixed true and false values.
Method checkBooleanContent() provides analyze, which case we have.
Fist two cases can be optimized in the same manner as usual IF/Choose function.
Third case does not allow such optimization and requires evaluating both expression and only then  choosing right value from evaluated arrays according  Boolean array value. It means that we need to restore “non optimized” way to evaluate such IF/Choose function. Restore of “non optimized” way lead to ignoring of AttrPtg elements, which provide skipping no needed evaluation during optimized way.

Due to  that IF/Choose function may be nested,   decision of optimized/non optimized  is concern only for current IF/Choose function, need to keep in such function stack and be restored after  finishing of  evaluating current IF/Choose.

6) There will suitable to have common interface for Function and FreeRefFunction.

Do you have any ideas and proposals?

P.S. "test/resources" must be a source folder to run tests. TestArrays - main test.
Comment 4 Petr.Udalau 2009-12-02 05:54:26 UTC
Created attachment 24658 [details]
Final patch
Comment 5 Petr.Udalau 2009-12-02 05:55:21 UTC
Created attachment 24659 [details]
Tests for array formulas
Comment 6 Josh Micich 2009-12-09 00:16:06 UTC
Created attachment 24682 [details]
another draft of the patch

Thanks for all your work so far but I'm afraid it's going to take a few more iterations before we are able to commit this patch.  Please take a look at my latest reworked version. I have made several changes which are described below.  There are still outstanding issues that need to be addressed.  A very small amount of work has been submitted (svn r888577, svn r888582, svn r888665).  These changes don't add any functionality; they just resolve existing issues and make the pending patch simpler.

Firstly, I've added Apache License notices to each new file.  It's important that you agree to this, because we cannot take code contributions without it.
The tests you provided were in a package "com.exigen...". These have been moved to the appropriate POI packages.
The tests needed to be converted to junit 3.8 because that's what POI uses (changing junit versions is probably something we'd decide to do project wide).  The test classes were re-named to follow POI conventions, though the names could still be more descriptive.
POI expects tests to be silent (not write to std-out or any log) when successful.
Test code should not throw exceptions to be caught by other test code during successful tests.  The only exceptions that should be thrown are ones deliberately caused in the production code being tested.
One test method for every example spreadsheet cell is probably too many.  Some simplification was done, but consider using techniques like those in TestFormulasFromSpreadsheet and TestIndexFunctionFromSpreadsheet etc.

General coding/formatting guidelines:
  - please use unix line endings (LF instead of CRLF)
  - don't mix indent style (tabs or spaces) in the same file
  - always use {} for the bodies of if/else for/while statements
  - don't assign to parameters
  - turn on IDE warnings for unnecessary code

I made significant simplifications to ArrayEval (while keeping all your junit tests working).
I got rid of ArrayEval.illegalForAggregation because there is a much simpler way to get #N/A out of formulas with array size errors:  The #N/A value originates in ArrayFormulaEvaluatorHelper.getArrayValue() (new method), and just flows up the evaluation stack through standard techniques.
The method ArrayEval.arrayAsArea() is also not needed (due to introduction of TwoDEval).  TwoDEval simplifies a lot of the code you needed to touch in existing function implementations.

There are also some problems in your interpretation of how array formulas work.  I am pretty sure that at least one of the test cases is asserting the wrong result (testOffsetArray() - ConstantArray.xls C149).  I have added a new test class TestArrayEvaluationExtra which shows what the correct behaviour should be.  The new failing test cases are currently disabled.  While investigating, I have found out that array eval elements cannot be cell references or areas.  So in this example, OFFSET cannot return an array of cell references.  I've disabled ArrayEval from doing this, and made other fixes to keep all test cases working.  The solution will probably involve getting the RVA types of the tokens correct during the formula parsing phase.  Read the OOO documentation for an introduction to this.  Unfortunately I think that information is incomplete, but it is the best resource I know of.  We will probably have to discover all the missing details for ourselves.

I also found that using the "Evaluate Formula" button (available in Excel 2007) helped me understand how some of the example formulas should be evaluated.  

The short-circuit-if evaluation stuff is probably not needed.  The first parameter to IF has type 'V' and I have a feeling that any parameter of type 'V' should always have a scalar ValueEval passed at runtime.  If you pass an array for a parameter of type 'V' then the whole function gets invoked once for each item in the array.  You have already written some code (prepareArgsForLoop etc) that does this.  It's just a question of having it be invoked at the right time.

There are a few other problems in the code:  Don't instantiate ArrayEval until you have its contents calculated.   (i.e. don't create half-initialised objects).  FunctionWithArraySupport.supportArray() looks like it is needed for answering questions about function parameters which are probably resolvable by the existing FunctionMetadata.  Similar problems with ArrayMode - it's probably providing an incomplete solution to something that can perhaps be solved better using proper RVA token classification.


Please review the re-worked patch and address these issues as best you can.
Comment 7 Josh Micich 2009-12-09 00:20:25 UTC
Created attachment 24683 [details]
added test files

The zip contains 6 files destined for test-data/spreadsheet/ (the standard location for test sample spreadsheet files).
Comment 8 Petr.Udalau 2009-12-18 06:36:34 UTC
Josh,
We appreciate your time and effort to integrate our stuff.

>Thanks for all your work so far but I'm afraid it's going to take a few more
>iterations before we are able to commit this patch.
We understand it and ready to continue our work to match your requirements.


>The tests needed to be converted to junit 3.8 because that's what POI uses
>(changing junit versions is probably something we'd decide to do project wide).
> The test classes were re-named to follow POI conventions, though the names
>could still be more descriptive.
>POI expects tests to be silent (not write to std-out or any log) when
>successful.
>Test code should not throw exceptions to be caught by other test code during
>successful tests.  The only exceptions that should be thrown are ones
>deliberately caused in the production code being tested.
>One test method for every example spreadsheet cell is probably too many.  Some
>simplification was done, but consider using techniques like those in
>TestFormulasFromSpreadsheet and TestIndexFunctionFromSpreadsheet etc.
We reworked our tests to fit your requirements.

>General coding/formatting guidelines:
>  - please use unix line endings (LF instead of CRLF)
>  - don't mix indent style (tabs or spaces) in the same file
>  - always use {} for the bodies of if/else for/while statements
>  - don't assign to parameters
>  - turn on IDE warnings for unnecessary code
We are in process of improving our code.

>I made significant simplifications to ArrayEval (while keeping all your junit
>tests working).
>I got rid of ArrayEval.illegalForAggregation because there is a much simpler
>way to get #N/A out of formulas with array size errors:  The #N/A value
>originates in ArrayFormulaEvaluatorHelper.getArrayValue() (new method), and
>just flows up the evaluation stack through standard techniques.
>The method ArrayEval.arrayAsArea() is also not needed (due to introduction of
>TwoDEval).  TwoDEval simplifies a lot of the code you needed to touch in
>existing function implementations.
The above is reasonable improvement. Thanks.

>There are also some problems in your interpretation of how array formulas work.
> I am pretty sure that at least one of the test cases is asserting the wrong
>result (testOffsetArray() - ConstantArray.xls C149).  I have added a new test
>class TestArrayEvaluationExtra which shows what the correct behaviour should
>be.  The new failing test cases are currently disabled.  While investigating, I
>have found out that array eval elements cannot be cell references or areas.  So
>in this example, OFFSET cannot return an array of cell references.  I've
>disabled ArrayEval from doing this, and made other fixes to keep all test cases
>working.  The solution will probably involve getting the RVA types of the
>tokens correct during the formula parsing phase.  Read the OOO documentation
>for an introduction to this.  Unfortunately I think that information is
>incomplete, but it is the best resource I know of.  We will probably have to
>discover all the missing details for ourselves.
The issue you have raised is really exists. We spent quite a time researching it. 
1.	Behavior of function chain based on array of references (could be return by function offset(), index() etc.) is different in Excel 2003 and Excel 2007 (I mean engines, not formats). For example =SUM(SUM((OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))) returns 5 in Excel 2003 and #N/A in Excel 2007. In excel 2003 =average((OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})) returns 4 but =sum(average((OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))) returns 2.5. We think that reason is in inconsistency of excel concepts. We decided to stick to Excel 2007 logic. First of all we assume that offset() returns array of references – if create an array formula {=OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})} on two cells, it would fill it with 4 and 1 correspondingly. The behavior we should match is the following one: 
a.	=SUM(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))   ---->    4
b.	=SUM(SUM(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})))   ---->    #N/A!
(It is very strange result, result of functions have to depend on only args)
c.	=SQRT(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))  ---->    2
d.	=SUM(SQRT(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})))  ------->  #Value!
e.	=SQRT(SUM(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))) --------->  2
f.	=SUM(SQRT(SUM(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})))) --------> #N/A!
2.	The logic above mathematically seems strange but need to emulate it. To do this we add componentError as element on ArrayEval which contains „future” error to be exposed by aggregation function used second time.
3.	We consider ArrayEval as „universal” container which is transfered via evaluation stack and need to contain results of any types including references. Thus OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1})) returns array of references (ArrayEval contains AreaEvals) and {OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))} on two cells returns 4 and 1. We have removed your validation from ArrayEval.
 

>I also found that using the "Evaluate Formula" button (available in Excel 2007)
>helped me understand how some of the example formulas should be evaluated.  
Thanks for the hint

>The short-circuit-if evaluation stuff is probably not needed.  The first
>parameter to IF has type 'V' and I have a feeling that any parameter of type
>'V' should always have a scalar ValueEval passed at runtime.  If you pass an
>array for a parameter of type 'V' then the whole function gets invoked once for
>each item in the array.  You have already written some code (prepareArgsForLoop
>etc) that does this.  It's just a question of having it be invoked at the right
>time.
We need this change because IF first parameter is array. In this case IF optimization not always accepatble. We do iteration over array but in non-optimized manner.

>There are a few other problems in the code:  Don't instantiate ArrayEval until
you have its contents calculated.   (i.e. don't create half-initialised
objects).
We’ll look at it.
>  FunctionWithArraySupport.supportArray() looks like it is needed for
>answering questions about function parameters which are probably resolvable by
>the existing FunctionMetadata.  Similar problems with ArrayMode - it's probably
>providing an incomplete solution to something that can perhaps be solved better
>using proper RVA token classification.
Our investigation showed that RVA types from FunctionMetadata is not enough for decision if function accept array as argument or we need to iterate over it. V and A types are OK but problem is with R type. For example for IF function is
 1      IF      2       3       R       V R R          
But function logic requires second/third parameters needed to be iterated (as scalar values). The same is true for COLUMN, CHOOSE, VLOOKUP, HLOOKUP. Thus, for the moment we stick to our approach.
There is some unnecessary code which is commented. This code is represents that specificaiotns differs from our interfaces.

There is also problem with determining return type of formula for example for formulas:
=SUM(IF(A67:B68>2;A67:B68;1))
=SUM(OFFSET(A149:B150,{1,0},{1,0},{1,1},{1,1}))

OFFSET	metadata
78	OFFSET	3	5	R	R V V V V	x

According to the specification: arguments that marked "V" in specification represented like an array and "R"-arguments represented like an area eval. Return type of functions is "R" in specification but OFFSET returns single value for SUM(by preparing args for one itaration of loop) and IF returns array for SUM.

So, we make one-week investigation and you can see result. Our patch is raw but set/remove functionality is done, there is some problems with evaluation but we can't find right specification how to evaluate formulas(and even Excel have problems with evaluations of array formulas)

We will waiting for your response and your point of view on the problem of array formulas evaluation in Excel. We have to determine way of evaluation in POI: we have to imitate Excel calculations or find some rules of calculations of formulas(may be using RVA specification).

Regards.
Comment 9 Petr.Udalau 2009-12-18 06:38:37 UTC
Created attachment 24731 [details]
last reworked patch
Comment 10 Petr.Udalau 2009-12-18 06:39:18 UTC
Created attachment 24732 [details]
new test data
Comment 11 Petr.Udalau 2009-12-21 07:36:15 UTC
Created attachment 24746 [details]
last reworked patch with test data

Updated INDEX function and restructured tests.
Comment 12 Petr.Udalau 2010-08-06 09:59:48 UTC
Hi,
Our team have made investigation and implement new functionality before. And we need response about our patch with array formulas in POI. We are waiting since 2009 year. So what about that?
Comment 13 Yegor Kozlov 2010-08-09 06:10:05 UTC
Petr,

Thanks again for your contribution. The patch was partially applied in Dec 2009 but there are still issues to address. I think we will need at least one iteration to sort them out. 

The patch consists of two big parts that were splitted between me and Josh Micich.

 (1) usermodel API for arrays formulas (me). It is way more complex than just  setter methods. The API should be consistent and mimic Excel - it should block changing cells included in multi-cell arrays, prevent data corruption, etc.  This part is mostly done, see the history in r894469, r893870 and r893625. 

Most of usermodel tests are consolidated in org.apache.poi.ss.usermodel.BaseTestSheetUpdateArrayFormulas which is used both by HSSF and XSSF. 

 (2) evaluation of array formulas (Josh). This is the hardest part and it is still in progress. Let's wait feedback from Josh on how much remains to be done. 

Regards,
Yegor
Comment 14 Dugas du Villard 2011-12-08 15:20:07 UTC
Created attachment 28053 [details]
File from Excel 2003 that can not be re-save using POI due to Array Formula In Cell B37
Comment 15 Dugas du Villard 2011-12-08 15:21:25 UTC
Created attachment 28054 [details]
28053: File from Excel 2003 that can be re-save using POI No Array Formula in this one
Comment 16 Dugas du Villard 2011-12-09 07:31:41 UTC
Comment on attachment 28053 [details]
File from Excel 2003 that can not be re-save using POI due to Array Formula In Cell B37

Hi,

I have a problem with POI due probably to array formula. It's why I'm posting
the probleme in this bug.

I have attach 2 xls files. One contain array formula (in cell B37..), the other
not. 
I get an error "java.lang.RuntimeException: Failed to find a matching shared
formula record" when I save the file containing array formula using POI
(method: workbook.write) If I save the file using Excel 2010, and then pass it
to POI, it's ok. But if it is saved from Excel 2003, there is a problem.

Is there a fix for this? Is it link to the bug describe here?

Thanks.
Christophe.
Comment 17 Dugas du Villard 2012-03-08 09:06:06 UTC
Hi,

What's up about this bug?

Thanks.
Christophe.
Comment 18 Yegor Kozlov 2012-03-11 07:26:55 UTC
Unfortunately this patch is still not applied. I hope to find time to check it in in the next POI-3.9 cycle.  

Yegor

(In reply to comment #17)
> Hi,
> 
> What's up about this bug?
> 
> Thanks.
> Christophe.
Comment 19 Dugas du Villard 2012-07-26 09:48:28 UTC
Hi,

I'm sorry to bother you with this functionnality.
But I realy need it in my project.

What's the news?
Is there a patch or some code I can add in my soft to fix it without waiting for a new release?

Thanks.
Christophe
Comment 20 Yegor Kozlov 2012-07-26 14:28:04 UTC
> 
> What's the news?

there is no news which means that the patch is not applied.

The patch needs some work and no one yet volunteered to work on it. 

> Is there a patch or some code I can add in my soft to fix it without waiting
> for a new release?
> 

The patch is attached, go ahead and apply it to trunk. 

Yegor
Comment 21 coppertan.au 2012-12-17 04:22:16 UTC
Hi Dugas,

  Did you try applying the patch successfully?  If so, which version of POI did you use?
Comment 22 Dan Peebles 2014-05-08 16:30:17 UTC
Any updates? Not having this prevents me from using POI usefully, and it seems like a pity to let all this effort languish for so long (~5 years now?)
Comment 23 Yegor Kozlov 2014-05-12 07:33:49 UTC
(In reply to Dan Peebles from comment #22)
> Any updates? Not having this prevents me from using POI usefully, and it
> seems like a pity to let all this effort languish for so long (~5 years now?)

unfortunately it is not that easy.

The patch in its current form is not applicable. It breaks unit tests and violates the API. I tried to adapt it but had to give up because of lack of time. We need someone to work it, otherwise it will stay in Bugzilla for good.  If you are willing to contribute then please do work on it!
Comment 24 Dominik Stadler 2015-03-22 14:32:47 UTC
No update on this for a long time, likely some things already work nowadays, so I think it best to report fresh bugs for anything that is still missing.