Apache OpenOffice (AOO) Bugzilla – Issue 18704
statistics functions need review
Last modified: 2017-05-20 11:11:16 UTC
The discussion of recent article on Slashdot mentioned that, for all its money and resources, Microsoft has botched the statistical functions in the Excel spreadsheet: http://www.agresearch.cri.nz/Science/Statistics/exceluse1.htm http://www.practicalstats.com/Pages/excelstats.html http://www.cof.orst.edu/net/software/excel/no-stats.php http://www.npl.co.uk/ssfm/ssfm1/validate/testing/excel.html If with all their resources they can't do it right, I have to say that I'm concerned about a package (Calc) put together mostly by volunteer programmers (not statisticians and not numerical analysts, I would presume). This is not to slam those people, but statistics is no place for amateurs, even those copying formulas out of a book. I would suggest that all statistical functions in Calc be vetted by an computer-experienced statistician, who will run extensive tests and report on weak areas. All functions, especially the statistical functions, should also be examined by someone experienced in numerical computing. At the very, very least, the developers need to find a large suite of tests to thoroughly exercise the code.
You opened a task, so it's your turn. Frank
OK, Frank, so what do I do now? What I'm trying to do is to bring the issue of numerical accuracy and stability to the attention of those who work on Calc functions. Perhaps I should change it from a task to something else? I took a guess that a "task" would be the right category. It's been almost 20 years (gaah!) since I took a numerical analysis course, so to say that I'm a bit rusty on this would be an understatement! I'm willing to help look for test suites, reports, books, and articles on this sort of thing, but I'm not really qualified to meddle with the code. I just want to make sure that someone is listening and putting to good use whatever information I find, rather than talking to myself (I do too much of that already!).
Hi Eike, can you have a look at this ? I think it's your construction site so you can give him better feedback than I. Frank
First of all, the history of OpenOffice.org is that it arose from StarOffice that wasn't developed by volunteers at all but by contracted employees instead. Even today, about 90% of developers are not volunteers but still paid by Sun. Most Calc statistical functions were programmed by a "doctor of mathematics", which of course doesn't imply that there are no errors or that they are fail-safe and have no significant roundoff errors in specific situations. A good example is the standard deviation of 90000001, 90000002, 90000003 mentioned in http://www.agresearch.cri.nz/Science/Statistics/exceluse1.htm where we seem to use the same "naive" algorithm as the competitive product does, I guess. Thanks for those URLs anyways. I would happily accept patches of better algorithms if someone would take the burdon and review current implementations. For Calc's rehabilitation I must say that I don't think we're doing worse than any other spreadsheet application ;-) However, AFAIK it is common knowledge amongst statisticians that you just can't use a spreadsheet for doing real analysis. Source code is in sc/source/core/tool/interpr*.cxx, most statistical functions in interpr3.cxx and interpr5.cxx, if anyone wants to start on it?
I'm just a user, but also an actuary. I would be glad to help with a review of some of the financial calculations (I have written code in BASIC -- several versions -- and, years ago, FORTRAN). Actuarial emphasis was on difference equations which translates into computational formulas much easier than analytical statistical functions (for example, the complex Tschebischev functions are replaceable, with no loss of accuracy, by the much simpler Jenkens fifth-difference equations). If you want my input, please let me know which functions you would want me to review. Note that I recognize that I am not perfect, that looking at something too long makes for overlooking mistakes and I would want someone else to review anything I posted.
Robert, Thanks for your offer. As I already wrote, sc/source/core/tool/interpr[35].cxx contain most of the statistical funtions, the method names (ScInterpreter::Sc...) are mostly "self-speaking", unfortunately not all of them and comments are rare. Some basic functions like computing the variance are in interpr1.cxx. Of course an obvious glitch like the standard deviation of 90000001, 90000002, 90000003 would be nice to have fixed by using a better algorithm. The standard deviation is computed in interpr1.cxx method ScInterpreter::ScStDev(), which in turn makes use of ScInterpreter::GetStVarParams(). Additionally, the result is affected by ::rtl::math::approxSub(). Those approx...() functions are used to eliminate roundoff errors of finite precision representation of infinite binary digits periods. For example, 0.1+0.2-0.3 normally doesn't result in exactly 0.0 if not computed using such approx functions. The approx functions may be found in sal/inc/rtl/math.hxx
Adjusting target milestone
Having the "not determined" milestone target was on purpose here, since it isn't clear when and by whom all the work will be done. Definitely not a "I will do this for OOo2.0" task. Added the needhelp keyword, and reset target to "not determined".
*** Issue 27002 has been marked as a duplicate of this issue. ***
Set PleaseHelp target.
Note: Standard deviation and related functions are fixed with issue 22811.
CORREL, COVAR, PEARSON, RSQ, STEYX, SLOPE, INTERCEPT, FORECAST are fixed with issue 78250.
*** Issue 79944 has been marked as a duplicate of this issue. ***
Created attachment 52850 [details] patches for gammadist, chidist and inverse dist, details in the readme inside the zip
add me to CC
Note to myself: content of patches.zip attached is obsoleted by further work received by mail. Taken over by issue 90703.
GAMMADIST, CHIDIST, GAMMAINV, CHIINV fixed with issue 90703. Much improvement for BETAINV, FINV, TINV as well.
Reset assigne to the default "issues@openoffice.apache.org".