Bug 62836 - [PATCH] Implementation of Excel TREND function
Summary: [PATCH] Implementation of Excel TREND function
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: 4.0.x-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-19 06:17 UTC by Matthias Becht
Modified: 2018-11-02 16:23 UTC (History)
0 users



Attachments
TREND patch (15.63 KB, patch)
2018-10-19 06:17 UTC, Matthias Becht
Details | Diff
Test cases for TREND (46.50 KB, application/octet-stream)
2018-10-19 06:19 UTC, Matthias Becht
Details
Trend handling all X's being the same (18.24 KB, patch)
2018-10-27 15:11 UTC, Yegor Kozlov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Becht 2018-10-19 06:17:30 UTC
Created attachment 36200 [details]
TREND patch

This is an incomplete implementation of the Excel TREND function. Most cases do work already, but some don't work yet due to the way the math library handles multiple linear regression. Specifically, cases where there are more x variables than the sample size and some cases where there are duplicate x values fail. The former seems to have been fixed since in the math library (https://github.com/Hipparchus-Math/hipparchus/issues/13), but the fix isn't in the currently used version yet. I'm not sure how to fix the latter because having duplicate x values doesn't seem to make too much sense, so I'm not entirely sure what Excel actually does.
Some of the code (especially for handling all the different shapes excel allows) is pretty rough, but I hope it's acceptable as a first implementation.
Comment 1 Matthias Becht 2018-10-19 06:19:40 UTC
Created attachment 36201 [details]
Test cases for TREND
Comment 2 Yegor Kozlov 2018-10-21 18:32:40 UTC
Pretty impressive. This definitely should go in trunk. 

The not implemented cases aren't a problem of Commons Math.

From a mathematical standpoint one needs minimum two data points to perform linear regression. Excel handles the one-point case in which it returns the data value, but it is more an internal convention. OpenOffice and Google Spreadsheets return an error in such a case. POI can mimic Excel and return the data value too and we can add a simple check.

The case when all y- and x- values are the same is not obvious. Commons Math throws SingularMatrixException if the sample size is less than 5. Given G110:G114 and H110:H114 containing the same value (e.g. 5):
TREND(G110:G112,H110:H112) // throws SingularMatrixException
TREND(G110:G113,H110:H113) // throws SingularMatrixException
TREND(G110:G114,H110:H114) // evaluates to 5

So, POI only needs to handle singular matrices smaller than 5. Again, it can be done with a simple check without patching Commons Math.

>> Some of the code (especially for handling all the different shapes excel allows) is pretty rough, 
>> but I hope it's acceptable as a first implementation.

This is certainly okay for a first pass. 
I wonder if GROWTH, which has a similar semantics, can reuse the code from TREND.
Comment 3 Matthias Becht 2018-10-22 10:24:26 UTC
I'm still not entirely sure about all of Excel's behavior:

- So if all x and y values are the same, Excel just returns that value, no matter what?

- If the y's are different but all x's are the same, Commons Math works but gives a different answer than Excel. For instance the test "All x values the same" (B65; currently skipped) gives -32 but Excel gives 12, which is simply the average of the two y values (20 and 4). Is this just mathematically illegal, and Excel chooses to simply average the y values or what is happening here?

- When there is only one y value, Excel generally just returns that, but it does still calculate the linear regression if the constant is set to false in order to force the line to go through zero. I'm assuming it just adds (0,0) as another point in that case. When there is still only one y value but multiple x variables to go with it, the y value is always returned, even if the constant is set to false, which also makes sense since it would need more than just the point (0,0) to add another sample. I guess the easiest thing to do then would be to automatically add the point (0,0) if there is only one y and one x value and the constant is set to false. When there are multiple x variables, the new_x's, if given, need to be exactly the same dimensions as x (which has to be either a single column or row), or #REF! is given.

I must admit that I don't know much about the underlying mathematics of this, so there are probably many obvious things I'm missing.
Comment 4 Yegor Kozlov 2018-10-22 16:38:01 UTC
>> - So if all x and y values are the same, Excel just returns that value, no matter what? 
I concluded it empirically and it seems to be true. The only exception are cass when Excel returns !REF#, but the code in Trend performs this check before evaluation.

>> If the y's are different but all x's are the same, Commons Math works but gives a different answer than Excel. 
>> For instance the test "All x values the same" (B65; currently skipped) gives -32 but Excel gives 12,
>> which is simply the average of the two y values (20 and 4). Is this just mathematically illegal, 
>> and Excel chooses to simply average the y values or what is happening here?

A good catch. If all x values are the same then Excel really returns the average. 

TREND({20, 4}, {3, 3})                   12  
TREND({20, 4, 3},{3, 3, 3})              9 
TREND({20, 4, 3, 3},{3, 3, 3, 3})        7.5
TREND({20, 4, 3, 3, 5},{3, 3, 3, 3, 3})  7

changing an x value just beyond significance, e.g. 3 -> 3.0000001  changes the behavior. Excel starts performing regression and evaluates to 20

TREND({20, 4},{3.0000001, 3})                    20  
TREND({20, 4, 3},{3.0000001, 3, 3})              20 
TREND({20, 4, 3, 3},{3.0000001, 3, 3, 3})        20
TREND({20, 4, 3, 3, 5},{3.0000001, 3, 3, 3, 3})  20

>> When there is only one y value, Excel generally just returns that,
>> but it does still calculate the linear regression if the constant is set to false in order to force the line to go through zero.

Another good catch. I guess in this case the regression parameters will be {0, y/x} where y/x is the slope.
Comment 5 Matthias Becht 2018-10-24 07:23:48 UTC
Hmm, I just remembered what the biggest problem was, which I forgot to mention before. There needs to be at least one more sample than there are x variables in order for the multiple linear regression to work out (or at least that's how I understand it). Commons Math throws an IllegalArgumentException is there are too few samples, which is why I had added the check for `x[0].length >= x.length`. Excel, however, still calculates something, I'm just not sure exactly what. It almost seems as if it just somehow picks which variables to use.

Regarding all x's being the same: it seems that Excel does do a bit more than just return the average. If the constant is set to FALSE and values are given for new_x's, it pretends that only one point was given: (x value - constant, average y value). Using that, it can then do a simple linear regression between that point and (0, 0). I don't know what it does though when there are multiple x variables on top of that because it still calculates *something*.

I also realized later that what I said about Excel silently adding the point (0,0) when the constant is set to FALSE and there is only one y value is not entirely true either. Excel actually doesn't just return the y value when there are multiple x variables, it still calculates something, but again, I'm not entirely sure *what*.

I'm not entirely sure what the best solution is in this case because every option I tried caused other problems. Of course it is quite possible that there is a very obvious solution which I simply missed.

Sometimes I wish that instead of making Excel more "user-friendly", the developers had just stuck to the basic mathematical rules instead, especially since hitting any of these corner cases while trying to predict something probably means you're doing something wrong anyways.
Comment 6 Yegor Kozlov 2018-10-27 15:09:56 UTC
I'm inclined to commit the patch as it is. The code is great and perfectly covers the basic functionality. The edge cases is a subtle area . Excel  handles them differently from OpenOffice and Google Spreadsheets anyway and returning an error value from POI is certainly okay for now.

I attempted to handle the case of all X's being the same and return the average. With my changes the tests in the cells A65 and A66 pass, but after your last comment I realized the code is still incomplete. I attached my changes, but not sure if is worth including them.

Please let us know if you plan follow-up patches. If not, I'm going to check-in the original patch.

Regards,
Yegor
Comment 7 Yegor Kozlov 2018-10-27 15:11:31 UTC
Created attachment 36218 [details]
Trend handling all X's being the same

Remove the <skip> directive in cells A65 and A66. These tests pass with this patch.
Comment 8 Yegor Kozlov 2018-11-02 13:35:43 UTC
Patched applied in r1845586. The fix will be included in upcoming 4.0.1 release

Regards,
Yegor
Comment 9 Matthias Becht 2018-11-02 16:23:23 UTC
(In reply to Yegor Kozlov from comment #8)
> Patched applied in r1845586. The fix will be included in upcoming 4.0.1
> release

Sorry that it took me so long to reply, I was fairly busy this week.

That seems fine to me. I did actually start rewriting the code for some of the edge cases, but every time I though I had figured it out, Excel surprised me again, so I think the original patch is still the most usable solution. Maybe we'll figure out what Excel really does someday, but until then, I guess we'll just have to go with this.