Issue 18704

Summary: statistics functions need review
Product: Calc Reporter: philmperry <philperry>
Component: codeAssignee: AOO issues mailing list <issues>
Status: ACCEPTED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: 994phij, issues, khirano, pagalmes.lists, rb.henschel
Version: OOo 1.1 RCKeywords: needhelp
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: TASK Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 27597, 32833, 26022, 26836, 47296, 69069, 74704, 78250, 86294, 89691, 90703, 90759, 101316    
Issue Blocks:    
Attachments:
Description Flags
patches for gammadist, chidist and inverse dist, details in the readme inside the zip none

Description philmperry 2003-08-27 03:18:39 UTC
The discussion of recent article on Slashdot mentioned that, for all its money
and resources, Microsoft has botched the statistical functions in the Excel
spreadsheet:

http://www.agresearch.cri.nz/Science/Statistics/exceluse1.htm
http://www.practicalstats.com/Pages/excelstats.html
http://www.cof.orst.edu/net/software/excel/no-stats.php
http://www.npl.co.uk/ssfm/ssfm1/validate/testing/excel.html

If with all their resources they can't do it right, I have to say that I'm
concerned about a package (Calc) put together mostly by volunteer programmers
(not statisticians and not numerical analysts, I would presume). This is not to
slam those people, but statistics is no place for amateurs, even those copying
formulas out of a book.

I would suggest that all statistical functions in Calc be vetted by an
computer-experienced statistician, who will run extensive tests and report on
weak areas. All functions, especially the statistical functions, should also be
examined by someone experienced in numerical computing. At the very, very least,
the developers need to find a large suite of tests to thoroughly exercise the code.
Comment 1 frank 2003-08-27 09:33:47 UTC
You opened a task, so it's your turn.

Frank
Comment 2 philmperry 2003-08-29 03:33:13 UTC
OK, Frank, so what do I do now? What I'm trying to do is to bring the
issue of numerical accuracy and stability to the attention of those
who work on Calc functions. Perhaps I should change it from a task to
something else? I took a guess that a "task" would be the right
category. It's been almost 20 years (gaah!) since I took a numerical
analysis course, so to say that I'm a bit rusty on this would be an
understatement! I'm willing to help look for test suites, reports,
books, and articles on this sort of thing, but I'm not really
qualified to meddle with the code. I just want to make sure that
someone is listening and putting to good use whatever information I
find, rather than talking to myself (I do too much of that already!).
Comment 3 frank 2003-08-29 10:07:42 UTC
Hi Eike,

can you have a look at this ?

I think it's your construction site so you can give him better
feedback than I.

Frank
Comment 4 ooo 2003-09-01 12:36:01 UTC
First of all, the history of OpenOffice.org is that it arose from
StarOffice that wasn't developed by volunteers at all but by
contracted employees instead. Even today, about 90% of developers are
not volunteers but still paid by Sun.

Most Calc statistical functions were programmed by a "doctor of
mathematics", which of course doesn't imply that there are no errors
or that they are fail-safe and have no significant roundoff errors in
specific situations. A good example is the standard deviation of 
90000001, 90000002, 90000003 mentioned in
http://www.agresearch.cri.nz/Science/Statistics/exceluse1.htm where we
seem to use the same "naive" algorithm as the competitive product
does, I guess. Thanks for those URLs anyways.

I would happily accept patches of better algorithms if someone would
take the burdon and review current implementations. For Calc's
rehabilitation I must say that I don't think we're doing worse than
any other spreadsheet application ;-)  However, AFAIK it is common
knowledge amongst statisticians that you just can't use a spreadsheet
for doing real analysis.

Source code is in sc/source/core/tool/interpr*.cxx, most statistical
functions in interpr3.cxx and interpr5.cxx, if anyone wants to start
on it?
Comment 5 rblackeagle 2003-09-01 15:53:09 UTC
I'm just a user, but also an actuary.  I would be glad to help with a
review of some of the financial calculations (I have written code in
BASIC -- several versions -- and, years ago, FORTRAN).  Actuarial
emphasis was on difference equations which translates into
computational formulas much easier than analytical statistical
functions (for example, the complex Tschebischev functions are
replaceable, with no loss of accuracy, by the much simpler Jenkens
fifth-difference equations).

If you want my input, please let me know which functions you would
want me to review.  Note that I recognize that I am not perfect, that
looking at something too long makes for overlooking mistakes and I
would want someone else to review anything I posted.
Comment 6 ooo 2003-09-03 14:42:51 UTC
Robert,

Thanks for your offer. As I already wrote,
sc/source/core/tool/interpr[35].cxx contain most of the statistical
funtions, the method names (ScInterpreter::Sc...) are mostly
"self-speaking", unfortunately not all of them and comments are rare.
Some basic functions like computing the variance are in interpr1.cxx.
Of course an obvious glitch like the standard deviation of 90000001,
90000002, 90000003 would be nice to have fixed by using a better
algorithm. The standard deviation is computed in interpr1.cxx method
ScInterpreter::ScStDev(), which in turn makes use of
ScInterpreter::GetStVarParams(). Additionally, the result is affected
by ::rtl::math::approxSub(). Those approx...() functions are used to
eliminate roundoff errors of finite precision representation of
infinite binary digits periods. For example, 0.1+0.2-0.3 normally
doesn't result in exactly 0.0 if not computed using such approx
functions. The approx functions may be found in sal/inc/rtl/math.hxx
Comment 7 oc 2003-12-12 09:12:07 UTC
Adjusting target milestone
Comment 8 ooo 2003-12-12 13:45:30 UTC
Having the "not determined" milestone target was on purpose here, since it isn't
clear when and by whom all the work will be done. Definitely not a "I will do
this for OOo2.0" task. Added the needhelp keyword, and reset target to "not
determined".
Comment 9 frank 2004-03-29 13:00:26 UTC
*** Issue 27002 has been marked as a duplicate of this issue. ***
Comment 10 ooo 2004-08-04 10:40:48 UTC
Set PleaseHelp target.
Comment 11 niklas.nebel 2004-11-08 09:37:27 UTC
Note: Standard deviation and related functions are fixed with issue 22811.
Comment 12 ooo 2007-07-18 19:09:28 UTC
CORREL, COVAR, PEARSON, RSQ, STEYX, SLOPE, INTERCEPT, FORECAST are fixed with
issue 78250.
Comment 13 ooo 2007-07-24 12:56:05 UTC
*** Issue 79944 has been marked as a duplicate of this issue. ***
Comment 14 Regina Henschel 2008-04-16 13:20:36 UTC
Created attachment 52850 [details]
patches for gammadist, chidist and inverse dist, details in the readme inside the zip
Comment 15 Regina Henschel 2008-04-16 13:22:10 UTC
add me to CC
Comment 16 ooo 2008-06-13 16:54:30 UTC
Note to myself: content of patches.zip attached is obsoleted by further work
received by mail. Taken over by issue 90703.
Comment 17 ooo 2008-06-18 17:29:14 UTC
GAMMADIST, CHIDIST, GAMMAINV, CHIINV fixed with issue 90703.
Much improvement for BETAINV, FINV, TINV as well.
Comment 18 Marcus 2017-05-20 11:11:16 UTC
Reset assigne to the default "issues@openoffice.apache.org".