|Summary:||[PATCH] Support for structured references with Excel tables.|
|Component:||XSSF||Assignee:||POI Developers List <dev>|
|Bug Depends on:|
|Bug Blocks:||58227, 55413, 57721, 61722|
Structured reference support for POI
Trunk patch adding support for Structured References in formulas, with unit tests
fix to reuse XSSFEvaluationWorkbook inside itself
avoid auto-boxing ints for row/column TreeTable lookups
re-use XSSFEvaluationWorkbook when expanding a shared formula
cache XSSFCellEvaluator instances and look up by hash key - much faster
adds a unit test file and method that make heavy use of structured reference formulas, for performance testing
lazily cache the cell cache key hash
Description daniel.livshen 2015-04-21 11:25:11 UTC
Created attachment 32671 [details] Structured reference support for POI Added new support for the Structured Reference syntax that was introduced in MS Excel 2007. You can find more information about structured references in : https://support.office.com/en-us/article/Using-structured-references-with-Excel-tables-F5ED2452-2337-4F71-BED3-C8AE6D2B276E The new patch adds the ability of parsing structured references (Only works in XSSF as in HSSF you don't have tables) and converting it to normal (and well implemented) Area Reference. Also added support for Indirect evaluations with structured references.
Comment 1 Greg Woolsey 2016-06-06 20:22:18 UTC
This is great, just what I needed for https://bz.apache.org/bugzilla/show_bug.cgi?id=57721
Comment 2 Javen O'Neal 2016-06-07 08:21:29 UTC
attachment 32671 [details] has merge conflicts against the trunk. Additional work will be needed to bring this back to something that can be committed to the trunk. A few more unit tests are probably also needed for the new functionality that were added to make sure we don't break structured references in the future.
Comment 3 Greg Woolsey 2016-06-07 16:23:43 UTC
I have this working against trunk, all tests pass, but no new tests yet. I'll work on a patch today and tomorrow with tests and some tweaks to the original patch. For one, it had company network specific changes to build.xml that leaked into the patch (I've removed them). There is some code that could benefit from turning a series of constants into an Enum, and it performs very slowly for large workbooks/large tables. I can see a couple of things that would be quick, small changes for some big gains. Again, with tests. First, though, I have to test against 3.12, as a library I use (Vaadin Spreadsheet) doesn't like something that changed in the API between 3.12 and trunk, as that's where my POC has to work for my day job :)
Comment 4 Nick Burch 2016-06-08 12:41:06 UTC
(In reply to GW from comment #3) > First, though, I have to test against 3.12, as a library I use (Vaadin > Spreadsheet) doesn't like something that changed in the API between 3.12 and > trunk, as that's where my POC has to work for my day job :) We've tried reaching out to the Vaadin Spreadsheet folks a few times now, but never had any response. If you have contacts there, please ping them to get in touch! (IIRC it was conditional formatting stuff we were most recently interested in collaborating on)
Comment 5 Greg Woolsey 2016-06-08 15:55:00 UTC
(In reply to Nick Burch from comment #4) > (In reply to GW from comment #3) > > First, though, I have to test against 3.12, as a library I use (Vaadin > > Spreadsheet) doesn't like something that changed in the API between 3.12 and > > trunk, as that's where my POC has to work for my day job :) > > We've tried reaching out to the Vaadin Spreadsheet folks a few times now, > but never had any response. If you have contacts there, please ping them to > get in touch! (IIRC it was conditional formatting stuff we were most > recently interested in collaborating on) I will do that. I'm lobbying to go all-in with POI and Vaadin Spreadsheet for the next big thing for my new gig. I've been a user of POI for over 10 years, and Vaadin for 5. If I get my way, we will be paying Vaadin customers and can have some say in product direction. For example, I found a bug in their conditional formatting stuff yesterday :) After I finish the patch unit tests, I'll be filing a bug for them about it. It may relate to Excel Table structured reference syntax in the conditional formatting formula, in which case they definitely should be talking to this team!
Comment 6 Greg Woolsey 2016-06-08 20:27:42 UTC
Created attachment 33927 [details] Trunk patch adding support for Structured References in formulas, with unit tests The new patch (same date as comment) is against TRUNK, and contains new passing unit tests to exercise the new code and notice new bugs. The original patch was missing some corner cases around special characters and escapes in table names, now covered by tests and handled by the code. (why does Excel allow named ranges and tables to start with a backslash!?) There are new comments for some existing and some new code that is there to improve execution performance at the cost of static cached metadata - changes to the underlying OOXML objects may not be seen unless the cached data is reset. There was already some of that, but it wasn't called out in the JavaDoc comments. Not likely in normal use, as most uses are either building or reading, but not building, reading, modifying, then calculating again, but since the underlying objects are exposed through public setters, it can happen.
Comment 7 Javen O'Neal 2016-06-09 02:33:41 UTC
Thanks for the quick turn around Greg! I'm slowly reviewing what you have. I committed StructuredReferences.xlsx in r1747482. I also added svn:eol-style=native to all the files in your patch so the diffs are shorter when reviewing these changes on a different OS. The import change on Match.java isn't needed. 1. Could you change String Table.isStructuredReference to a compiled Pattern for performance? 2. We're trying to move all junit3 tests to junit4 tests. Could you update TestStructuredReferences.java to use junit4 (org.junit.Test with @Test decorators, see TestXSSFFormulaParser.java for an example) 3. The changes to XSSFRowShifter looks like formulas with structured references cannot be row-shifted. Is that correct? If so, could you either fix that in your patch or open a new bug that depends on bug 57840? 4. Add a javadoc to XSSFWorkbook.getTable. This method rebuilds the table cache and returns a single item. If the cache isn't used elsewhere, then a non-caching linear search would be faster and use less memory.
Comment 9 Greg Woolsey 2016-06-09 16:37:31 UTC
(In reply to Javen O'Neal from comment #7) > Thanks for the quick turn around Greg! > I'm slowly reviewing what you have. I committed StructuredReferences.xlsx in > r1747482. > I also added svn:eol-style=native to all the files in your patch so the > diffs are shorter when reviewing these changes on a different OS. > The import change on Match.java isn't needed. > > 1. Could you change String Table.isStructuredReference to a compiled Pattern > for performance? > 2. We're trying to move all junit3 tests to junit4 tests. Could you update > TestStructuredReferences.java to use junit4 (org.junit.Test with @Test > decorators, see TestXSSFFormulaParser.java for an example) > 3. The changes to XSSFRowShifter looks like formulas with structured > references cannot be row-shifted. Is that correct? If so, could you either > fix that in your patch or open a new bug that depends on bug 57840? > 4. Add a javadoc to XSSFWorkbook.getTable. This method rebuilds the table > cache and returns a single item. If the cache isn't used elsewhere, then a > non-caching linear search would be faster and use less memory. Updated attachment has the requested changes. Thanks for the feedback, I've been a long time user but this is my first patch. Here are responses to your questions and comments: 1. Yes, changed. I missed this from the original patch I started from, thanks. 2. Done. The example test I picked turned out to be one not yet updated :) 3. Structured Reference syntax doesn't reference individual rows directly, so shifting things doesn't affect these. See the MS documentation link in the initial description. All structured reference syntax is to table columns and parts (#Data, #Headers, etc.) Formula evaluation dynamically constructs a 3DArea for the reference based on the current Table definition (I'm only caching references to the XSSFTable objects, not their start/end cell definitions). Since those objects already existed, my assumption was that inserting/deleting rows above a XSSFTable already updates the table start/end. If that's not the case, that's an existing bug - I haven't tried it. 4. Added JavaDoc
Comment 10 Javen O'Neal 2016-06-10 00:22:36 UTC
To make it easier to get this to a trunk-eligible state, I've created xssf_structured_references branch  in r1747607 so you and others can contribute unit tests and improvements with smaller diffs against this branch. I've applied your changes from attachment 33932 [details] in r1747612. When this is trunk-eligible, I'll merge this back with trunk.  https://svn.apache.org/viewvc/poi/branches/xssf_structured_references/
Comment 11 Javen O'Neal 2016-06-10 01:09:59 UTC
Moved getTable cache from XSSFWorkbook to XSSFEvaluationWorkbook to reduce the frequency of problems with a stale cache in r1747615-r1747616.
Comment 12 Greg Woolsey 2016-06-10 01:24:02 UTC
(In reply to Javen O'Neal from comment #11) > Moved getTable cache from XSSFWorkbook to XSSFEvaluationWorkbook to reduce > the frequency of problems with a stale cache in r1747615-r1747616. This kills performance - the reason I moved it to XSSFWorkbook in the first place was performance. Parsing and evaluating formulas on a moderately sized workbook (one table with 6 columns and 50,00 rows, one computed column) and a sheet with a small table of formulas referencing the data table (VLOOKUPs mostly) took over an hour on a P5 with 2GB allocated to the VM. Moving it to XSSFWorkbook reduced this to 6 minutes, which is still awful, the the problem at that point was the auto-boxing of rownums in all the row methods in XSSFSheet (15% of total time spent in Integer.compareTo). Lots of references via Google for performance tests with auto-boxing vs. primitives vs. explicitly referenced and retained primitive wrappers (Integer/Long). The table lookup was performed hundreds of thousands of times, as every cell evaluation created a new instance of XSSFEvaluationWorkbook, causing the cache to be rebuilt. Even just doing a linear search, if there are 3 or 4 tables, regardless of size, is hugely expensive because the XMLBeans package is terribly slow for repeated lookups. I tried to balance the inconvenience/problem of a stale cache with performance. I think, but haven't verified yet, that the real performance issue is that when evaluating all the cell formulas in a workbook, if multiple cells reference the same expression, the expression isn't cached, but recalculated over and over for all those cells. This is probably worst with range functions, when a column of cells do the same VLOOKUP over and over with different inputs against the same range. Especially if that range contains formula cells. All that to say, moving this back is going to have an extreme performance cost, and the real issues are deep. They probably deserve their own separate issues: 1. formula evaluations that require access to XML elements/attributes 2. XSSFSheet rownum auto-boxing 3. Formula evaluation intermediate result caching Eventually I'll try to capture just how many table lookups by name are done vs. how many formulas reference the table, to see how bad it is.
Comment 13 Javen O'Neal 2016-06-10 04:14:50 UTC
Could you generate and upload a workbook with either a lot of tables or large tables that demonstrates the performance problems? Unfortunately we don't have a meaningful way to track execution time in our continuous integration tests because of differing resources provided by the executors. Nonetheless, you could write a unit test that calculates the time spent evaluating structured references for A/B comparisons (such as the change from comment 11). Are tables being looked up on the XSSFWorkbook or the XSSFEvaluationWorkbook?
Comment 14 Javen O'Neal 2016-06-10 07:55:24 UTC
Fixed behavior of [#Totals] when Table does not have a Totals row and [#This Row] and [@] when rowIndex is outside the table start and stop rows. Added remaining unit tests for new functionality in r1747655. reintegrated xssf_structured_references branch to trunk in r1747657 and deleted branch in r1747657. What's remaining before this bug can be closed: * unit test for XSSFTable.findColumnIndex and getters added in  * unit test that demonstrates performance need to cache XSSFWorkbook.getTable(String name), either by decreasing the visibility of this method with @Internal or protected/package-private accessibility or by adding a method so that users can force a refresh of the cache if needed. * performance evaluation of any explicit boxing/unboxing to see if any speed increase can be gained without adding complexity to the code. * any other unit tests that I missed  https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFTable.java?r1=1747657&r2=1747656&pathrev=1747657
Comment 15 Nick Burch 2016-06-10 08:45:49 UTC
We do have a stated policy that an evaluator will cache for performance, and it's up to you as a user to notify it / clear it / create a new one if you go changing cells after you create the evaluator - https://poi.apache.org/spreadsheet/eval.html#Performance Could we extend / follow that principle to help with the caching/performance?
Comment 16 Javen O'Neal 2016-06-10 08:54:22 UTC
(In reply to Nick Burch from comment #15) > We do have a stated policy that an evaluator will cache for performance The problem is that Greg stated in comment 12 that there's a performance hit if caching isn't done in XSSFWorkbook, not XSSFEvaluationWorkbook. I am comfortable with single-use caching in XSSFEvaluationWorkbook, but uncomfortable with the same code in XSSFWorkbook.
Comment 17 Greg Woolsey 2016-06-10 16:54:00 UTC
(In reply to Javen O'Neal from comment #16) > (In reply to Nick Burch from comment #15) > > We do have a stated policy that an evaluator will cache for performance > The problem is that Greg stated in comment 12 that there's a performance > hit if caching isn't done in XSSFWorkbook, not XSSFEvaluationWorkbook. I am > comfortable with single-use caching in XSSFEvaluationWorkbook, but > uncomfortable with the same code in XSSFWorkbook. I would also prefer not to cache in XSSFWorkbook, but a new XSSFEvaluationWorkbook is created every time a cell is evaluated, rather than reusing the same instance for all cells in a calculation/evaluation run. In fact, the main method in XSSFEvaluationWorkbook, getFormulaTokens(EvaluationCell), creates a new instance rather than using itself! I don't see any shared state in that class or the parent class, so that seems a likely easy fix - just use itself in the call to FormulaParser.parse() rather than a new instance. See attached patch-57840-reuse-XSSFEvaluationWorkbook.txt for the tiny fix. Unit tests run a little faster with it for me, too.
Comment 18 Greg Woolsey 2016-06-10 16:55:11 UTC
Created attachment 33937 [details] fix to reuse XSSFEvaluationWorkbook inside itself
Comment 19 Javen O'Neal 2016-06-10 17:14:56 UTC
(In reply to Greg Woolsey from comment #18) > Created attachment 33937 [details] > fix to reuse XSSFEvaluationWorkbook inside itself Applied in r1747754.
Comment 20 Javen O'Neal 2016-06-10 18:52:04 UTC
(In reply to Javen O'Neal from comment #14) > What's remaining before this bug can be closed: > * unit test for XSSFTable.findColumnIndex and getters added in  Applied in r1747762 and r1747771.
Comment 21 Greg Woolsey 2016-06-10 20:37:36 UTC
(In reply to Javen O'Neal from comment #14) > What's remaining before this bug can be closed: > * performance evaluation of any explicit boxing/unboxing to see if any speed > increase can be gained without adding complexity to the code. I found adding the VM parameter -XX:AutoBoxCacheMax=1048704 improved performance of my slow test by 12% (can't share it yet, waiting for approval of my sanitized version of proprietary data). That's SpreadsheetVersion.EXCEL2007.getMaxRows() + 128 (-127>0 are also cached by Integer). That forces Integer to cache all the values up to max rows. Obviously don't do this unless are are also setting the VM heap size larger too. Since POI already uses lots of memory for documents, this won't be a problem for people who need it. I think with that documentation note somewhere the autoboxing issue can be considered dealt with for now.
Comment 22 Javen O'Neal 2016-06-10 20:59:34 UTC
Doing some Google searching, it looks like using new Integer(int) rather than Integer.valueOf (the explicit version of implicit auto-boxing) may use more memory due to duplicated objects wrapping primitives, but is faster because it doesn't look up lookup cached values to save this memory. Did you notice any performance difference changing the boxing on the code relating to this bug?
Comment 23 Greg Woolsey 2016-06-11 00:47:53 UTC
(In reply to Javen O'Neal from comment #22) > Doing some Google searching, it looks like using new Integer(int) rather > than Integer.valueOf (the explicit version of implicit auto-boxing) may use > more memory due to duplicated objects wrapping primitives, but is faster > because it doesn't look up lookup cached values to save this memory. > > Did you notice any performance difference changing the boxing on the code > relating to this bug? Setting the Integer cache size, as noted above, helped significantly. Attached is a patch to XSSFSheet and XSSFRow, to explicitly create new Integer instances everywhere ints were autoboxed for calls to the TreeMap instances in those classes. This adds another 4% boost over just the cache setting, it turns out. Then there is the big winner so far, reducing my test time by > 30%: cache XSSFCEvaluationSheet refs so we can cache XSSFEvaluationCell refs in a HashMap, using a custom lightweight key object (int row/col fields, using good hash practice). HashMap lookups by direct row/column (which is what we have down in the function evaluators) is way, way faster than TreeMap lookups. For my sample, with 7,274 formula cells, most of which include one or more VLOOKUP calls to a Structured Reference with 3,726 rows, this was a huge improvement.
Comment 24 Greg Woolsey 2016-06-11 00:48:56 UTC
Created attachment 33938 [details] avoid auto-boxing ints for row/column TreeTable lookups
Comment 25 Greg Woolsey 2016-06-11 00:50:54 UTC
Created attachment 33939 [details] re-use XSSFEvaluationWorkbook when expanding a shared formula
Comment 26 Greg Woolsey 2016-06-11 00:52:03 UTC
Created attachment 33940 [details] cache XSSFCellEvaluator instances and look up by hash key - much faster
Comment 27 Greg Woolsey 2016-06-11 01:05:39 UTC
I broke out my improvements today into individual patch diffs for easier inspection and consumption. The last is the biggest improvement, none are huge code changes. I'm still waiting for approval on my sanitized test data file. Hopefully I can upload that tomorrow. Currently takes 93 seconds still to evaluate all 7,274 formula cells. I wonder if it is worth having VLOOKUP and similar range functions track previously evaluated ranges, and if the same range shows up multiple times index or hash its values for future reference? Hmm.
Comment 28 Javen O'Neal 2016-06-11 02:31:50 UTC
Applied attachments from comment 24, 25, and 26 in r1747837, r1747840, and r1747838, respectively to trunk. Thanks for the thorough profiling and simple, standalone patches! Very easy to review.
Comment 29 Javen O'Neal 2016-06-11 03:19:33 UTC
Findbugs complained  about explicit boxing of the Integers and not typechecking before casting. Any way we can suppress these? This SO post  suggests either a edu.umd.cs.findbugs.annotations.SuppressWarnings annotation or adding filters to the findbugs configuration in build.xml.  https://builds.apache.org/job/POI/1336/findbugsResult/new/  http://stackoverflow.com/questions/1829904/is-there-a-way-to-ignore-a-single-findbugs-warning
Comment 30 Javen O'Neal 2016-06-11 03:52:16 UTC
Nevermind. I found https://svn.apache.org/viewvc/poi/trunk/src/resources/devtools/findbugs-filters.xml Looks like this is the place to do it.
Comment 31 Greg Woolsey 2016-06-11 06:30:48 UTC
Created attachment 33941 [details] adds a unit test file and method that make heavy use of structured reference formulas, for performance testing My latest patch attachment, patch-57840-heavy-structured-ref-formula-use-test.tar.gz, has my sample file (got approval for the sanitized version) and a test method to loop through and evaluate all the formula cells. It runs successfully, but still takes a while to process the 7,274 formula cells. most have IF() calls that execute VLOOKUP() using a structured reference to a named table with over 3,700 rows in 8 columns. This exercises things enough it can be profiled using Java Mission Control or other tools to help find the remaining code hot spots. The test could be augmented with an upper bound timing test if desired, to fail if the eval run takes longer than say 5 minutes (my last run was 90 sec.). That would flag big performance regressions, like introducing a code path that misses the evaluator caches.
Comment 32 Greg Woolsey 2016-06-11 07:58:56 UTC
Created attachment 33942 [details] lazily cache the cell cache key hash Last one for now - I'm still unclear exactly why, but lazy calculating then storing the hash for the evaluation cell cache key speeds things up quite a bit (10-20%). It looks like for some uses only equals() is used on a newly created key, so avoiding the hash calculation helps, while on others (the ones stored in the map?) hashcode() is called multiple times, so caching it once calculated saves time.
Comment 33 Javen O'Neal 2016-06-11 11:46:20 UTC
Applied to trunk. comment 31 attachment 33941 [details] r1747878 comment 32 attachment 33942 [details] r1747881 I disabled the stress test since it's a demonstration of performance, not functionality. Perhaps we should have an ant target for testing the performance of POI.
Comment 34 Nick Burch 2016-06-11 12:11:45 UTC
I may still be missing something, but why not just put the table cache on the XSSFWorkbook, provide an internal "clear cache" method, then have the table related methods on XSSFSheet and XSSFTable trigger the cache-clear when they've done something that would invalidate it?
Comment 35 Javen O'Neal 2016-06-11 12:38:46 UTC
I think Greg's attachment 33939 [details] patch, applied in r1747840, solve most of the caching problems.
Comment 36 Andreas Beeker 2016-06-12 01:45:46 UTC
TL;DR: please comment non obvious performance hacks As I usually delete all *ssf bug notifications, I wasn't aware of those autoboxing changes and after findbugs complaint about them, I changed them back to the state before the changes (i.e. as there were quite a lot of svn changes and I haven't checked the history, when going through the findbugs issues) After I've committed the autoboxing, I thought there might be something wrong, otherwise such simple issues would have been already removed. So now the code for XSSFRow is reverted again and I've added the necessary ignore rules to the findbugs-filters, but for future modifications, it would be good if you add a comment, when doing something on purpose against usual optimizations, otherwise the next patch might throw out those lines again ...
Comment 37 Greg Woolsey 2016-06-12 05:01:32 UTC
(In reply to Andreas Beeker from comment #36) > TL;DR: please comment non obvious performance hacks > Yes, I agree, I should have added comments. I'm typically verbose, but got ahead of myself this time. I didn't notice FindBugs in the mix, but regardless, I agree, especially on large distributed projects commenting every change is a good default rule. Thanks for taking care of this one, and the reminder.
Comment 38 Greg Woolsey 2016-06-14 22:51:45 UTC
Anything else needed I can help with before this issue is resolved? Table reference syntax works for all my samples, and performance is now vastly improved (although still not where I think it should be, but that's a different set of bugs and patches I'll file eventually). The FindBugs issues have been addressed, as have unit tests. Perhaps we still need some comments as to why the explicit boxing is happening, so they don't get undone?
Comment 39 Javen O'Neal 2016-06-15 02:16:01 UTC
(In reply to Greg Woolsey from comment #38) > Perhaps we still need some comments as to why the explicit boxing is happening, > so they don't get undone? Applied in r1748479. I kept this bug open in case you had anything else you discovered from your testing. I'll close it now. These changes will be available in POI 3.15 beta 2, to be released in the next couple months. I'm assuming that general formula evaluation is faster as a whole with r1747840, not just structured references. In future bugs, we should look at: * comment 5: get Vaadin Spreadsheet working with POI versions higher than 3.12. * comment 12: When evaluating multiple formulas in a sheet, cache the formula result of cells with the same expression * comment 21: add to POI documentation that "-XX:AutoBoxCacheMax=1048704" may improve performance of working with large worksheets (perhaps this is no longer true now that we're using explicit boxing). * comment 23: use HashMap instead of TreeMap to hold XSSFRows and XSSFCells in XSSFSheet and XSSFRows, respectively. Obviously this would make moving rows or getting all rows in a range extremely expensive. * comment 34: trigger a cache reset, either with an explicit public method or implicitly with an internal method (this could make regular workbook operations slower since any cell that is part of a table or any cell in a table that references a value that changes in the workbook). Several methods in XSSFTable still use caches that may contain stale data. * comment 38: continue to improve formula evaluation performance
Comment 40 Greg Woolsey 2016-06-15 02:54:41 UTC
(In reply to Javen O'Neal from comment #39) Thanks. I'm moving on to the Vaadin Spreadsheet side of things for my POC, as the POI feature set and performance are acceptable at this stage. I'll likely file new issues/patches in the future if I get the project accepted. I've submitted Vaadin bug https://dev.vaadin.com/ticket/19952 with a very significant improvement to their current formula performance, especially related to conditional formatting. They were creating a new evaluation context for every _cell_ evaluated. And to do it, they were inserting a new row in an existing sheet, and modifying the first cell's formula. The row # is in the middle of the data for one of my sample workbooks, and that whole scheme was an ugly hack. I showed them how they could evaluate a formula without a Cell using an existing WorkbookEvaluator, using code I found in the unit test framework. That, along with the cached results it could now leverage, took my sample workbook from HOURS to marginally more time than the unit test I added to POI. I also have a patch to use POI 3.15 that I hope they take. But that has more in it, as they have some ugly code I couldn't bear to leave alone. I want a real-looking API along with the updated compatibility.