Bug 61841 - Unnecessary long computation when evaluating VLOOKUP on all column reference
Summary: Unnecessary long computation when evaluating VLOOKUP on all column reference
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: 3.15-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-01 14:00 UTC by Luca Martini
Modified: 2019-03-30 19:39 UTC (History)
4 users (show)



Attachments
Simple testcase (12.82 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2017-12-01 14:00 UTC, Luca Martini
Details
PoiFormulaChecker class demonstrating problem (2.24 KB, text/x-java)
2019-03-25 18:45 UTC, Travis Burtrum
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Martini 2017-12-01 14:00:29 UTC
Created attachment 35573 [details]
Simple testcase

Hi all,
I spotted a problem in the evaluation of VLOOKUP function when the table area is defined as all column reference.

Basically, when a "all column" reference is used, the evaluation code treats the reference as if it's applied to the maximum number of rows available for the workbook version.

In practice every reference as a $C:$D is translated as a $C$1:$D$1048576.
When such references are used a table argument for a VLOOKUP (exact match) and the value to be searched is not found, the evaluation code loops for 2^20 times.

A proposed fix could be to translate the "all column" reference to a reference that goes from row 1 to the maximum row of the considered sheet.

I attached a simple xlsx workbook that takes a *very long* time to evaluate about 300 VLOOKUPs.

A fix could be to enhance the logic of OperationEvaluationContext.getDynamicReference to translate the reference to a narrow area.
I am available for any clarification.

Best regards,
    Luca
Comment 1 Nick Burch 2017-12-01 14:05:27 UTC
Would you be able to put together a patch with the improved logic in?

See http://poi.apache.org/guidelines.html and http://poi.apache.org/howtobuild.html and http://poi.apache.org/subversion.html for more on building and contributing!
Comment 2 Luca Martini 2017-12-01 15:35:34 UTC
Nick,
I have already contributed some patches in the past. 
However I am not sure to have the time to work to this fix at the moment.
I opened the bug for future reference and to check if someone with more knowledge about evaluators could help.
In fact, at a first glance, the current workbook is not easy recoverable from the WorkbookEvaluator without changing its interface and I am not sure that is safe and sound to do so.
Comment 3 Javen O'Neal 2017-12-01 18:05:44 UTC
Looking at the POI implementation of Vlookup.java, the class currently doesn't have a reference to the sheet that contains the lookup table. This will make a fix non-trivial.

The function that creates the AreaEval that is passed to the evaluate function might have a reference to the lookup sheet.
Caveats: lookup table may be contained in a different workbook or sheet than the cell that contains the formula that is being evaluated.
Comment 4 Greg Woolsey 2017-12-05 19:53:35 UTC
It looks to me like the sheet is available:

Vlookup.evalutate() receives a ValueEval that is converted to a TwoDEval via LookupUtils.resolveTableArrayArg().

TwoDEval interface has only one base class implementing it, AreaEvalBase.

AreaEvalBase abstract class has 2 non-test implementations, cached and lazy.  In both cases actual cell ValueEval references are available, which means sheet/row/column eventually.  Seems to me we could augment the TwoDEval interface to indicate the last physical values for rows and columns along with the existing getWidth() and getHeight() methods.  These could be used when creating ValueVector objects for the lookup iterations to reduce the vector size to the maximum defined values, since no matches would exist outside those anyway.
Comment 5 Greg Woolsey 2017-12-06 00:19:12 UTC
Changes in r1817252

Interesting.  In a local test with the attached sample file, I found these results:

45s (second run) 
with current codebase issuing FormulaEvaluator.evaluateAll() on the workbook.

19s 
By just changing XSSFEvaluationSheet.getCell(row, col) to immediately return null if the row index > sheet.getLastRowNum() 

14.4s
when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes from a TreeMap.lastKey() which has to navigate the tree each time to find the last key.

12.1s
after optimizing the blank cell tracking a bit to know about the last row with data.

That's all without changing anything int he VLOOKUP evaluation and still iterating over the max # of rows per column.

Of this remaining time, about 2/3 is taken up in the formula evaluation caching and tracking mechanism.  Bypassing it for null cells causes test failures, which shows it is necessary, but relatively expensive.  It appears to try to optimize and minimize the "empty cell" rectangular regions it holds. but assumes processing by row then column.  That may be a memory/time optimization we want to consider allowing additional strategies for.

Note that this shortcut logic doesn't change the result of any methods, only avoids busywork that didn't apply to the "nonexistent cell" cases.

This doesn't optimize VLOOKUP directly, but is about 70% improvement sufficient?

Changing the VLOOKUP code itself is actually significantly more complex, because POI handles sheets by row internally, and columns are second-class constructs.  There is no easy way to determine the last row with data in a column other than iterating over all defined rows.  With these optimizations, the extra iterations should fail fast.
Comment 6 Javen O'Neal 2017-12-06 09:56:36 UTC
(In reply to Greg Woolsey from comment #5)
> 14.4s
> when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes
> from a TreeMap.lastKey() which has to navigate the tree each time to find
> the last key.

Side-note: should XSSFSheet cache the last row rather than iterating over the TreeMap for every call? Hopefully we could drop in a better Sorted map implementation, one that keeps a pointer to the first and last keys.
Comment 7 PJ Fanning 2017-12-06 10:52:36 UTC
Can we call the new method getLastRowNum instead of getlastRowNum?
Also, a lot of public methods have had their signatures changed. Do we need to keep the existing signatures too and deprecate the old versions of the methods?
Comment 8 Luca Martini 2017-12-06 14:28:19 UTC
(In reply to Greg Woolsey from comment #5)
> Changes in r1817252
> 

Thank you very much Greg.


> Of this remaining time, about 2/3 is taken up in the formula evaluation
> caching and tracking mechanism.  Bypassing it for null cells causes test
> failures, which shows it is necessary, but relatively expensive.  It appears
> to try to optimize and minimize the "empty cell" rectangular regions it
> holds. but assumes processing by row then column.  That may be a memory/time
> optimization we want to consider allowing additional strategies for.
> 
> Note that this shortcut logic doesn't change the result of any methods, only
> avoids busywork that didn't apply to the "nonexistent cell" cases.
> 
> This doesn't optimize VLOOKUP directly, but is about 70% improvement
> sufficient?

I think so. That's more or less the fix I had in mind.

> 
> Changing the VLOOKUP code itself is actually significantly more complex,
> because POI handles sheets by row internally, and columns are second-class
> constructs.  There is no easy way to determine the last row with data in a
> column other than iterating over all defined rows.  With these
> optimizations, the extra iterations should fail fast.

I know, and I think that with current state of the data structure, iterating over every defined row could be worse than your current solution.
Here we are still on POI 3.x, but it should not be difficult to integrate your changes in our forked version.

For me the bug is considered as resolved. I still do not change its status because others have still pending comments.

Best regards,
    Luca
Comment 9 Greg Woolsey 2017-12-06 23:55:05 UTC
(In reply to PJ Fanning from comment #7)
> Can we call the new method getLastRowNum instead of getlastRowNum?
> Also, a lot of public methods have had their signatures changed. Do we need
> to keep the existing signatures too and deprecate the old versions of the
> methods?

Argh, hate it when I miss things like that.  Yes, fixed it in r1817325.

The "public" methods here are either in classes marked in docs as "POI internal" or called only from methods similarly documented, so I think changing their signatures is fine.  I'm not aware of anyone trying to hook into formula evaluation down at the internal WorkbookEvaluator level.

I'm not opposed to duplicating and deprecating, but in this case I don't think it's needed.
Comment 10 Javen O'Neal 2017-12-07 01:30:10 UTC
Throw in some @Override's for good measure
Comment 11 Travis Burtrum 2019-03-25 15:38:49 UTC
Hello all,

We are upgrading from poi 3.17 to 4.0.1 and formula evaluation was broken, I used git bisect on poi and tracked it down to this line specifically:

https://github.com/apache/poi/commit/6390202491c3c77a77df3a1b4f2dc205ebc5b307#diff-e3bae629e723fe5bac4ba5d2e85451cbR69

It goes back to working if I comment that line out, or change _lastDefinedRow to _xs.getLastRowNum(), I changed it to instead in that if block do this to attempt to track the problem down:

throw new RuntimeException(String.format("XSSFBUG: would shortcircuit wrongly: rowIndex: %d, _lastDefinedRow: %d%n", rowIndex, _lastDefinedRow));

And the stack trace was:

Caused by: java.lang.RuntimeException: XSSFBUG: would shortcircuit wrongly: rowIndex: 40, _lastDefinedRow: 39

        at org.apache.poi.xssf.usermodel.XSSFEvaluationSheet.getCell(XSSFEvaluationSheet.java:72)
        at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateReference(WorkbookEvaluator.java:782)
        at org.apache.poi.ss.formula.SheetRefEvaluator.getEvalForCell(SheetRefEvaluator.java:48)
        at org.apache.poi.ss.formula.SheetRangeEvaluator.getEvalForCell(SheetRangeEvaluator.java:74)
        at org.apache.poi.ss.formula.LazyAreaEval.getRelativeValue(LazyAreaEval.java:51)
        at org.apache.poi.ss.formula.LazyAreaEval.getRelativeValue(LazyAreaEval.java:45)
        at org.apache.poi.ss.formula.eval.AreaEvalBase.getValue(AreaEvalBase.java:128)
        at org.apache.poi.ss.formula.functions.LookupUtils$ColumnVector.getItem(LookupUtils.java:100)
        at org.apache.poi.ss.formula.functions.LookupUtils.lookupIndexOfExactValue(LookupUtils.java:504)
        at org.apache.poi.ss.formula.functions.LookupUtils.lookupIndexOfValue(LookupUtils.java:483)
        at org.apache.poi.ss.formula.functions.Vlookup.evaluate(Vlookup.java:61)
        at org.apache.poi.ss.formula.functions.Var3or4ArgFunction.evaluate(Var3or4ArgFunction.java:36)
        at org.apache.poi.ss.formula.OperationEvaluatorFactory.evaluate(OperationEvaluatorFactory.java:144)
        at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateFormula(WorkbookEvaluator.java:534)
        at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateAny(WorkbookEvaluator.java:275)
        at org.apache.poi.ss.formula.WorkbookEvaluator.evaluate(WorkbookEvaluator.java:216)
        at org.apache.poi.xssf.usermodel.BaseXSSFFormulaEvaluator.evaluateFormulaCellValue(BaseXSSFFormulaEvaluator.java:56)
        at org.apache.poi.ss.formula.BaseFormulaEvaluator.evaluate(BaseFormulaEvaluator.java:110)

Clearly the number of rows in the spreadsheet is increasing without _lastDefinedRow being updated (as it's only set once in the constructor, and only updated on a call to clearAllCachedResultValues() which never happens).  I have tried and failed to reproduce this with a small standalone example though, I was hoping maybe someone else can spot the problem.

Thanks much,
Travis
Comment 12 Travis Burtrum 2019-03-25 18:45:48 UTC
Created attachment 36497 [details]
PoiFormulaChecker class demonstrating problem

Here is a minimal testcase showing the bug which seems to only affect XSSF.

when ran with HSSFWorkbook this is printed:

evaluatedCell: org.apache.poi.ss.usermodel.CellValue [39.57]
impl: HSSFWorkbook, maxRow: 4, elapsed: 209ms, 0s, 0min

when ran with XSSFWorkbook:

evaluatedCell: org.apache.poi.ss.usermodel.CellValue [#N/A]
impl: XSSFWorkbook, maxRow: 4, elapsed: 248ms, 0s, 0min

or with my patch to throw the exception instead of return null:

Exception in thread "main" java.lang.RuntimeException: XSSFBUG: would shortcircuit wrongly: rowIndex: 2, _lastDefinedRow: 1

	at org.apache.poi.xssf.usermodel.XSSFEvaluationSheet.getCell(XSSFEvaluationSheet.java:71)
	at org.apache.poi.ss.formula.OperationEvaluatorFactory.evaluate(OperationEvaluatorFactory.java:139)
	at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateFormula(WorkbookEvaluator.java:534)
	at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateAny(WorkbookEvaluator.java:275)
	at org.apache.poi.ss.formula.WorkbookEvaluator.evaluate(WorkbookEvaluator.java:216)
	at org.apache.poi.xssf.usermodel.BaseXSSFFormulaEvaluator.evaluateFormulaCellValue(BaseXSSFFormulaEvaluator.java:56)
	at org.apache.poi.ss.formula.BaseFormulaEvaluator.evaluate(BaseFormulaEvaluator.java:110)
	at PoiFormulaChecker.main(PoiFormulaChecker.java:47)
Comment 13 Dominik Stadler 2019-03-26 14:47:46 UTC
Thanks for the note, but commenting on a resolved issue has a high chance of being lost, better report a new issue with a mention of the previous issue and the information about how to rerproduce your problem.
Comment 14 Greg Woolsey 2019-03-30 19:39:07 UTC
This was fixed in trunk for bug #62993.  Not marking this as a duplicate, as the above comment should have been filed as a new issue referencing this issue.