Issue 89980 - Calc: Implement statistical SUMMARY() function
Summary: Calc: Implement statistical SUMMARY() function
Status: CLOSED OBSOLETE
Alias: None
Product: Calc
Classification: Application
Component: code (show other issues)
Version: OOo 2.4.0
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-26 21:03 UTC by discoleo
Modified: 2017-05-20 09:56 UTC (History)
2 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description discoleo 2008-05-26 21:03:48 UTC
Spreadsheets are used in academia and businesses alike.

In both instances, spreadsheet users often need to perform an exploratory data
analysis. They often need to compute some summaries of their data.

While Calc has (most) functions implemented separately, it lacks a unified
function to compute all parameters at once. MS Excel does come with such a
'Menu'-entry in the advanced statistical toolpack.

A.) COMPARATIVE SOFTWARE ANALYSIS
=================================

1.) MS Excel: does have a SUMMARY-Menu command in the advanced
    statistical package

2.) professional statistical programs:
    e.g. R (http://cran.R-project.org) has a summary() function

B.) UTILITY
===========
Both academia (research) and businesses alike would benefit from such a feature.

Despite my wish that people learn how to correctly perform descriptive
statistics, I have to admit that this is an illusive goal. It is much easier to
have a function that computes everything automatically and the users have
everything done correctly and without any user-interaction than hope for ALL
these people learning descriptive statistics.

Recently I examined a group of post-graduates during a post-graduate statistics
course, and I noticed that they prefer automatic analysis-tools. To my dismay,
they performed poorly with individual methods, but they all accessed the
toolkits summary command.

C.) MENU vs FUNCTION
====================
Excel did implement a menu-command. Unfortunately, IF the data-changes, the
computed statistics won't change and need to be manually updated. The user might
not notice that the data has changed.

Therefore I opt strongly against the Menu-command and for a proper-function.


D.) IMPLEMENTATION
==================

1.) Funcition-Name: obviously: SUMMARY()

2.) Parameters: (range; option)
  - range: the data range in the spreadsheet
  - option:
     -- default: "BRIEF" OR 0: compute most important summaries
     -- "DETAILED" OR 1: compute advanced summaries
     -- "PUBLICATION" OR 2: compute summaries and format as for
         publication in a professional journal
     -- "TABLE" OR 4: compute summaries as for publication,
         BUT display them in tabular format (for inclusion in a table)
     -- some other: ... lets leave the feature-list open
  - results:
     -- an array with the important summaries

IMPROTANT: This is an array function!

REASONING for Options:
 - most of the time, users need only some basic descriptive statistics.
 - I am against displaying ALL possible parameters, when most of them
   are of NO use.
 - the 'DETAILED' option will print more advanced statistics.
 - the 'PUBLICATION' OR 'PRINT'-option will additionally format the results
   as for print, so that the user can simply COPY/PASTE the results
 - the 'TABLE' as the 'PUBLICATION', but in tabular format

3.) COMPUTED OUTPUT
===================
 a.) the function shall verify IF the data is:
  -- numerical non-binary
  -- numerical or categorical, BUT binary
  -- categorical, NON-binary
  -- obviously, it can't detect ordinal values

 b.) it should verify IF the first row in the series is a LABEL
  -- it should skip any eventual labels

 c.) computed values should be:

 NUMERICAL DATA
   BRIEF DESCRIPTIVE STATISTICS
     min
     first quartile
     median
     third quartile
     max
     mean
     standard deviation
     median absolute deviation of the median (MAD)
      -- has the best possible breakdown point
    (I deliberately omit the mode; NO strong opinion about variance)

   ADVANCED DESCRIPTIVE STATISTICS
     ALL BRIEF PLUS additionally
     variance
     mode (well, lets include it)
     standard error
     percentile 2.5% and 97.5%
     [for the IPR95%]
     mean absolute deviation of the mean (ADM - or average deviation)
     [although this is better known as the previously mentioned MAD, it is
      actually a much less robust estimator than the MAD]
     very robust locationless estimators:
       Sn = 1.1926 * MEDi{MEDj|Xi-Xj|)
       Qn = QUANTILE( |Xi-Xj|; 0.25), where i<j
    [see http://www.jstor.org/pss/2291267 for details]

   PRINT
    - add text to the same cells and
    - group relevant parameters together
    - to allow easy COPY/PASTE:
    *median (range)* 'values for these parameters: actually median (min-max)'
    *median (interquartile range)* 'value (value - value)'
    *median (IPR95%)* 'value (value - value)'
    *mean (SD)* 'value (value)'

    - both the TEXT and the values should be provided in the same cell,
      so as to allow easy COPY/PASTE
    - both 3 versions of the MEDIAN (...) should be provided
      [they serve different purposes and range / IQR, IPR95% are alwys
       grouped with the median]
    - I prefer 'mean (SD)' over mean +- SD, because the mean +- SD makes
      actually NO sense; it is mean +- 2*SD that makes sense

   TABLE:
    - as PRINT, BUT display the names/text in one column
    - and the values in the next column
    - to easily COPY/PASTE a table

I will discuss the NON-numerical details in a later post.
Comment 1 discoleo 2008-06-18 11:09:47 UTC
There was a disucssion on the OASIS mailing lsit, too, although I am not that
happy with the outcom, see:

http://lists.oasis-open.org/archives/office-comment/200806/msg00003.html
Comment 2 oc 2008-07-15 10:44:10 UTC
reassigning features and enhancements to user requirements@openoffice.org which
will be the default owner for those tasks (was introduced some time ago)
Comment 3 Edwin Sharp 2013-12-13 11:23:42 UTC
Moved to

https://wiki.openoffice.org/wiki/Extensions/Ideas/Calc