Issue 47296 - HYPGEOMDIST function: error with big input values
Summary: HYPGEOMDIST function: error with big input values
Status: CLOSED FIXED
Alias: None
Product: Calc
Classification: Application
Component: editing (show other issues)
Version: OOo 1.1
Hardware: All Windows XP
: P3 Trivial (vote)
Target Milestone: ---
Assignee: frank
QA Contact: issues@sc
URL:
Keywords:
Depends on:
Blocks: 18704
  Show dependency tree
 
Reported: 2005-04-13 00:40 UTC by andi_p
Modified: 2013-08-07 15:13 UTC (History)
2 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
test program (5.25 KB, text/plain)
2005-09-05 21:41 UTC, kyoshida
no flags Details
proposed patch against SRC680_m121 (3.84 KB, patch)
2005-09-05 22:03 UTC, kyoshida
no flags Details | Diff
Forgot to "const" the function (3.85 KB, patch)
2005-09-07 04:50 UTC, kyoshida
no flags Details | Diff
revised patch (4.92 KB, patch)
2005-09-09 05:02 UTC, kyoshida
no flags Details | Diff
revised revised patch (4.90 KB, patch)
2005-09-09 05:37 UTC, kyoshida
no flags Details | Diff
revised test program (10.16 KB, text/plain)
2005-09-15 04:57 UTC, kyoshida
no flags Details
yet another revised patch (8.18 KB, patch)
2005-09-17 03:53 UTC, kyoshida
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description andi_p 2005-04-13 00:40:12 UTC
Few days ago I have tryed to do my homework with Calc when I use
=HYPGEOMDIST(1;50;500;5000) 
it results zero value
in Excel it results 0.028164524
Thank You for the attention. That is what I can do right now. Keep up the good 
work.

best regards
andi_p
Comment 1 stx123 2005-04-13 09:32:52 UTC
I moved the issue to the component Spreadsheet. Thanks for the submission.
Comment 2 frank 2005-04-13 10:47:05 UTC
Hi Daniel,

please have a look at this one.

Frank
Comment 3 daniel.rentz 2005-04-13 13:22:16 UTC
Should be done together with general rework of statistical functions.
Comment 4 daniel.rentz 2005-04-13 13:23:12 UTC
title
Comment 5 daniel.rentz 2005-04-13 13:25:38 UTC
see also issue 18704
Comment 6 kyoshida 2005-09-04 10:14:37 UTC
kohei -> dr: I've found a solution.  I'll post a patch in a few days.

Kohei
Comment 7 kyoshida 2005-09-05 21:41:15 UTC
Created attachment 29334 [details]
test program
Comment 8 kyoshida 2005-09-05 21:55:34 UTC
Compile this test program by:
g++ -Os -o hypgeom hypgeom.cxx

and compute the reported test case:

[kohei@localhost ooo]$ ./hypgeom 1 50 500 5000
population (N) = 5000
 # success (M) = 500
 # samples (n) = 50
             x = 1
old: 0
new: 0.02816452367904049

which matches the correct value reported by andi_p.  This also matches the value
computed by R:

> dhyper(1, 500, 4500, 50)
[1] 0.02816452

Also, execute the program without arguments, which will run all possible cases
for N = 0 to 1000 using the old and new algorithms, and compares the numerical
differences between the two algorithms.  Note this may take a few minutes to
complete.

Here is the result I get on my machine:

[kohei@localhost ooo]$ ./hypgeom
max diff: 3.885780586188048e-16
max diff: 3.885780586188048e-16
max diff: 4.440892098500626e-16
[...]
max diff: 1.332267629550188e-15
max diff: 1.332267629550188e-15
max diff: 1.332267629550188e-15
maximum difference between old and new methods: 1.332267629550188e-15
number of times comparisons returned a non-equal result: 0
number of comparisons performed: 2254200

What this means is that, of all possible scenarios for N <= 1000, the maximum
difference between the old and new algorithms is 1.332E-15, and none was
considered non-equal by using approxEqual( a, b ) method.  In other words, it's
safe to switch to the new algorithm without causing significant deviation from
the values calculated by the old algorithm.

A patch will follow....
Comment 9 kyoshida 2005-09-05 22:03:27 UTC
Created attachment 29337 [details]
proposed patch against SRC680_m121
Comment 10 kyoshida 2005-09-05 22:27:07 UTC
The reason why the old algorithm failed is because of an overflow in
double-precision variable during a calculation of binomial coefficient (note
that a calculation of hypergeometric distribution involves calculations of three
binomial coefficients).

  binomial coefficient: (n, k) = n! / k!(n-k)!

In the old method, these three factorials n!, k! and (n-k)! were immediately
calculated, which caused an overflow.  In my proposed method, I expand these
factorials into individual elements e.g.

  10! -> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

and store them into an enumerator array or a denominator array (whichever is
appropriate).  These elements get accumulated in these two arrays during all
three binomial coefficient calculations.

Once that's done, the elements in these two arrays are sorted in descending
order (from largest to smallest), and divisions are performed on all stored
elements sequentially from the largest elements downward (see my proposed patch
for implementation details).

This probably is the safest way of calculation when the calculation involves
multiple factorial elements.  And I recommend the way I calculated the
hypergeometric distribution here be applied to other calculation methods that
may have a similar numerical overflow problem.

Kohei
Comment 11 kyoshida 2005-09-05 22:41:57 UTC
Setting target to 2.0.1
Comment 12 kyoshida 2005-09-05 22:47:11 UTC
set issue type to PATCH
Comment 13 kyoshida 2005-09-07 04:50:51 UTC
Created attachment 29360 [details]
Forgot to "const" the function
Comment 14 kyoshida 2005-09-07 04:52:45 UTC
kohei->dr: Forgot to const SetBinomElements member function as it does not
change the object state.  Please use this patch instead.

Bye,
Kohei
Comment 15 niklas.nebel 2005-09-08 10:23:34 UTC
I think there should be some kind of limit to the size of the arrays, or we'll
have memory problems for large values (in addition to the already-existing
runtime problems).
Otherwise this looks good and can go into 2.0.1.
Comment 16 kyoshida 2005-09-08 14:17:43 UTC
Niklas,

I can look into the array size issue.  Thanks for bringing it up.  If no
objection, I can just use the maximum allowable size of std::multiset to set
array size limit.

Will issue another patch soonish.

Kohei
Comment 17 kyoshida 2005-09-09 05:02:07 UTC
Created attachment 29412 [details]
revised patch
Comment 18 kyoshida 2005-09-09 05:16:12 UTC
I ended up setting the array size limit to 50000, but this can be changed (feel
free to change).  multiset's max_size was 4294967295 in my environment, but a
calculation that could make the arrays that big would probably take more than a
normal user can tolerate.

The array size limit turned out to correspond with the number of successes (M)
minus one, i.e. when the number of successes is 49999 it calculates, but if it's
50000 it errors out.

A cursory profiling reveals that most time is spend on the SetBinomElements()
calls, so if a speedup needs to be done, that's where it needs to be.

That said, this is still much better than the current situation IMHO.

Oh one more thing.  I've removed the static_casts to unsigned long, because
switching to just using doubles made absolutely no difference.  Simpler the
better.  Let me know if the patch looks good.
Comment 19 kyoshida 2005-09-09 05:37:20 UTC
Created attachment 29413 [details]
revised revised patch
Comment 20 kyoshida 2005-09-09 05:38:30 UTC
Fixed a typo (enumerator -> numerator).  No functionality change.
Comment 21 kyoshida 2005-09-09 18:10:07 UTC
Found a better algorithm.  Will issue yet another patch soon.
Comment 22 kyoshida 2005-09-15 04:57:34 UTC
Created attachment 29571 [details]
revised test program
Comment 23 kyoshida 2005-09-15 05:19:49 UTC
Compile and run the test program:

[kohei@localhost ooo]$ g++ -DDEBUG -Os -o hypgeom5 hypgeom5.cxx
[kohei@localhost ooo]$ ./hypgeom5 123 70000 1234 1234567
population (N) = 1234567
 # success (M) = 1234
 # samples (n) = 70000
             x = 123
old: inf (7.582e-05 sec)
new: 8.56113475055104e-10 (0.0004709 sec)

The corresponding R's answer is:

> dhyper( 123, 1234, 1234567-1234, 70000 )
[1] 8.561135e-10

which matches my answer. :-)  I have of course checked many other cases against
R, which all turned out to be comparable.

Also, just like the previous test program, you can run it without arguments for
all cases for N = 0 to 100 in comparison to the old algorithm.  It was a "pass"
on my machine.

The major difference from my first submitted algorithm is that the new algorithm
does not put those factorial elements that would be removed later, thus avoiding
the unnecessary expansion and shrinkage of the containers.  For example, in my
first algorithm, sometimes the container expanded to over 100000, only to be
reduced to 50 later on.  That was way inefficient, and caused the calculation to
take way too much time.  Because of this improvement, a scenario that took over
10 seconds to complete by my first algorithm is now reduced to a fraction of a
second.

Note that my new algorithm handles large values pretty good.

I have also switched from using multiset to using a sorted vector, which
resulted in an extra 50% performance increase.  set/multiset is designed to be
efficient when a good mix of insertion, search, erasure is involved.  But
because my new algorithm only does insertion, I considered vector to be the
better choice, and it is indeed. :-)

I will prepare a patch sometime after tomorrow.  I have to go sleep now.

Kohei
Comment 24 kyoshida 2005-09-16 01:55:44 UTC
Niklas or Daniel,

What error should I trigger when the array exceeds a set max size?  I could just
call SetIllegalArgument(), but it's not exactly an argument error.

Thanks,
Kohei
Comment 25 niklas.nebel 2005-09-16 10:10:09 UTC
I think SetNoValue (it shows "#VALUE!") would be best, being a bit more
unspecific than "illegal argument". But there isn't really much consistency
about which of these errors is used.
Comment 26 kyoshida 2005-09-17 03:53:13 UTC
Created attachment 29615 [details]
yet another revised patch
Comment 27 kyoshida 2005-09-17 04:00:53 UTC
Here is hopefully my final patch.  I have done everything I could think of to
reduce the execution speed.  The max array size is set to 500000 based on the
acceptable execution time in my environment (~0.8 seconds).  This number happens
to correspond to x + min( n, M ), so it is evaluated before the calculation
begins.  It throws a #VALUE! error when the max array size is exceeded par
Niklas' suggestion (thanks!).

So... it's ready to go.  I hope you like my patch.

Kohei
Comment 28 niklas.nebel 2005-09-19 11:39:30 UTC
I'm taking this.
Comment 29 niklas.nebel 2005-09-19 11:43:35 UTC
This looks good.
I committed the last patch to CWS "dr41" (for OOo 2.0.1).
Thanks a lot, Kohei.
Comment 30 niklas.nebel 2005-10-07 14:02:08 UTC
back to QA for verification

re-open issue and reassign to fst@openoffice.org
Comment 31 niklas.nebel 2005-10-07 14:02:24 UTC
reassign to fst@openoffice.org
Comment 32 niklas.nebel 2005-10-07 14:02:32 UTC
reset resolution to FIXED
Comment 33 frank 2005-10-18 12:46:54 UTC
Found fixed on cws dr41 for Solaris, Linux and Windows
Comment 34 frank 2005-11-21 13:09:09 UTC
Found integrated on master m142 using Linux, Solaris and Windows Build