47296 – HYPGEOMDIST function: error with big input values

Issue 47296 - HYPGEOMDIST function: error with big input values

Summary: HYPGEOMDIST function: error with big input values

Status:	CLOSED FIXED

Alias:	None

Product:	Calc
Classification:	Application
Component:	editing (show other issues)
Version:	OOo 1.1
Hardware:	All Windows XP

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	frank
QA Contact:	issues@sc

URL:
Keywords:

Depends on:
Blocks:	18704
	Show dependency tree

Reported:	2005-04-13 00:40 UTC by andi_p
Modified:	2013-08-07 15:13 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	PATCH
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
test program (5.25 KB, text/plain) 2005-09-05 21:41 UTC, kyoshida	no flags	Details
proposed patch against SRC680_m121 (3.84 KB, patch) 2005-09-05 22:03 UTC, kyoshida	no flags	Details \| Diff
Forgot to "const" the function (3.85 KB, patch) 2005-09-07 04:50 UTC, kyoshida	no flags	Details \| Diff
revised patch (4.92 KB, patch) 2005-09-09 05:02 UTC, kyoshida	no flags	Details \| Diff
revised revised patch (4.90 KB, patch) 2005-09-09 05:37 UTC, kyoshida	no flags	Details \| Diff
revised test program (10.16 KB, text/plain) 2005-09-15 04:57 UTC, kyoshida	no flags	Details
yet another revised patch (8.18 KB, patch) 2005-09-17 03:53 UTC, kyoshida	no flags	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description andi_p 2005-04-13 00:40:12 UTC

Few days ago I have tryed to do my homework with Calc when I use
=HYPGEOMDIST(1;50;500;5000) 
it results zero value
in Excel it results 0.028164524
Thank You for the attention. That is what I can do right now. Keep up the good 
work.

best regards
andi_p

Comment 1 stx123 2005-04-13 09:32:52 UTC

I moved the issue to the component Spreadsheet. Thanks for the submission.

Comment 2 frank 2005-04-13 10:47:05 UTC

Hi Daniel,

please have a look at this one.

Frank

Comment 3 daniel.rentz 2005-04-13 13:22:16 UTC

Should be done together with general rework of statistical functions.

Comment 4 daniel.rentz 2005-04-13 13:23:12 UTC

title

Comment 5 daniel.rentz 2005-04-13 13:25:38 UTC

see also issue 18704

Comment 6 kyoshida 2005-09-04 10:14:37 UTC

kohei -> dr: I've found a solution.  I'll post a patch in a few days.

Kohei

Comment 7 kyoshida 2005-09-05 21:41:15 UTC

Created attachment 29334 [details]
test program

Comment 8 kyoshida 2005-09-05 21:55:34 UTC

Compile this test program by:
g++ -Os -o hypgeom hypgeom.cxx

and compute the reported test case:

[kohei@localhost ooo]$ ./hypgeom 1 50 500 5000
population (N) = 5000
 # success (M) = 500
 # samples (n) = 50
             x = 1
old: 0
new: 0.02816452367904049

which matches the correct value reported by andi_p.  This also matches the value
computed by R:

> dhyper(1, 500, 4500, 50)
[1] 0.02816452

Also, execute the program without arguments, which will run all possible cases
for N = 0 to 1000 using the old and new algorithms, and compares the numerical
differences between the two algorithms.  Note this may take a few minutes to
complete.

Here is the result I get on my machine:

[kohei@localhost ooo]$ ./hypgeom
max diff: 3.885780586188048e-16
max diff: 3.885780586188048e-16
max diff: 4.440892098500626e-16
[...]
max diff: 1.332267629550188e-15
max diff: 1.332267629550188e-15
max diff: 1.332267629550188e-15
maximum difference between old and new methods: 1.332267629550188e-15
number of times comparisons returned a non-equal result: 0
number of comparisons performed: 2254200

What this means is that, of all possible scenarios for N <= 1000, the maximum
difference between the old and new algorithms is 1.332E-15, and none was
considered non-equal by using approxEqual( a, b ) method.  In other words, it's
safe to switch to the new algorithm without causing significant deviation from
the values calculated by the old algorithm.

A patch will follow....

Comment 9 kyoshida 2005-09-05 22:03:27 UTC

Created attachment 29337 [details]
proposed patch against SRC680_m121

Comment 10 kyoshida 2005-09-05 22:27:07 UTC

The reason why the old algorithm failed is because of an overflow in
double-precision variable during a calculation of binomial coefficient (note
that a calculation of hypergeometric distribution involves calculations of three
binomial coefficients).

  binomial coefficient: (n, k) = n! / k!(n-k)!

In the old method, these three factorials n!, k! and (n-k)! were immediately
calculated, which caused an overflow.  In my proposed method, I expand these
factorials into individual elements e.g.

  10! -> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

and store them into an enumerator array or a denominator array (whichever is
appropriate).  These elements get accumulated in these two arrays during all
three binomial coefficient calculations.

Once that's done, the elements in these two arrays are sorted in descending
order (from largest to smallest), and divisions are performed on all stored
elements sequentially from the largest elements downward (see my proposed patch
for implementation details).

This probably is the safest way of calculation when the calculation involves
multiple factorial elements.  And I recommend the way I calculated the
hypergeometric distribution here be applied to other calculation methods that
may have a similar numerical overflow problem.

Kohei

Comment 11 kyoshida 2005-09-05 22:41:57 UTC

Setting target to 2.0.1

Comment 12 kyoshida 2005-09-05 22:47:11 UTC

set issue type to PATCH

Comment 13 kyoshida 2005-09-07 04:50:51 UTC

Created attachment 29360 [details]
Forgot to "const" the function

Comment 14 kyoshida 2005-09-07 04:52:45 UTC

kohei->dr: Forgot to const SetBinomElements member function as it does not
change the object state.  Please use this patch instead.

Bye,
Kohei

Comment 15 niklas.nebel 2005-09-08 10:23:34 UTC

I think there should be some kind of limit to the size of the arrays, or we'll
have memory problems for large values (in addition to the already-existing
runtime problems).
Otherwise this looks good and can go into 2.0.1.

Comment 16 kyoshida 2005-09-08 14:17:43 UTC

Niklas,

I can look into the array size issue.  Thanks for bringing it up.  If no
objection, I can just use the maximum allowable size of std::multiset to set
array size limit.

Will issue another patch soonish.

Kohei

Comment 17 kyoshida 2005-09-09 05:02:07 UTC

Created attachment 29412 [details]
revised patch

Comment 18 kyoshida 2005-09-09 05:16:12 UTC

I ended up setting the array size limit to 50000, but this can be changed (feel
free to change).  multiset's max_size was 4294967295 in my environment, but a
calculation that could make the arrays that big would probably take more than a
normal user can tolerate.

The array size limit turned out to correspond with the number of successes (M)
minus one, i.e. when the number of successes is 49999 it calculates, but if it's
50000 it errors out.

A cursory profiling reveals that most time is spend on the SetBinomElements()
calls, so if a speedup needs to be done, that's where it needs to be.

That said, this is still much better than the current situation IMHO.

Oh one more thing.  I've removed the static_casts to unsigned long, because
switching to just using doubles made absolutely no difference.  Simpler the
better.  Let me know if the patch looks good.

Comment 19 kyoshida 2005-09-09 05:37:20 UTC

Created attachment 29413 [details]
revised revised patch

Comment 20 kyoshida 2005-09-09 05:38:30 UTC

Fixed a typo (enumerator -> numerator).  No functionality change.

Comment 21 kyoshida 2005-09-09 18:10:07 UTC

Found a better algorithm.  Will issue yet another patch soon.

Comment 22 kyoshida 2005-09-15 04:57:34 UTC

Created attachment 29571 [details]
revised test program

Comment 23 kyoshida 2005-09-15 05:19:49 UTC

Compile and run the test program:

[kohei@localhost ooo]$ g++ -DDEBUG -Os -o hypgeom5 hypgeom5.cxx
[kohei@localhost ooo]$ ./hypgeom5 123 70000 1234 1234567
population (N) = 1234567
 # success (M) = 1234
 # samples (n) = 70000
             x = 123
old: inf (7.582e-05 sec)
new: 8.56113475055104e-10 (0.0004709 sec)

The corresponding R's answer is:

> dhyper( 123, 1234, 1234567-1234, 70000 )
[1] 8.561135e-10

which matches my answer. :-)  I have of course checked many other cases against
R, which all turned out to be comparable.

Also, just like the previous test program, you can run it without arguments for
all cases for N = 0 to 100 in comparison to the old algorithm.  It was a "pass"
on my machine.

The major difference from my first submitted algorithm is that the new algorithm
does not put those factorial elements that would be removed later, thus avoiding
the unnecessary expansion and shrinkage of the containers.  For example, in my
first algorithm, sometimes the container expanded to over 100000, only to be
reduced to 50 later on.  That was way inefficient, and caused the calculation to
take way too much time.  Because of this improvement, a scenario that took over
10 seconds to complete by my first algorithm is now reduced to a fraction of a
second.

Note that my new algorithm handles large values pretty good.

I have also switched from using multiset to using a sorted vector, which
resulted in an extra 50% performance increase.  set/multiset is designed to be
efficient when a good mix of insertion, search, erasure is involved.  But
because my new algorithm only does insertion, I considered vector to be the
better choice, and it is indeed. :-)

I will prepare a patch sometime after tomorrow.  I have to go sleep now.

Kohei

Comment 24 kyoshida 2005-09-16 01:55:44 UTC

Niklas or Daniel,

What error should I trigger when the array exceeds a set max size?  I could just
call SetIllegalArgument(), but it's not exactly an argument error.

Thanks,
Kohei

Comment 25 niklas.nebel 2005-09-16 10:10:09 UTC

I think SetNoValue (it shows "#VALUE!") would be best, being a bit more
unspecific than "illegal argument". But there isn't really much consistency
about which of these errors is used.

Comment 26 kyoshida 2005-09-17 03:53:13 UTC

Created attachment 29615 [details]
yet another revised patch

Comment 27 kyoshida 2005-09-17 04:00:53 UTC

Here is hopefully my final patch.  I have done everything I could think of to
reduce the execution speed.  The max array size is set to 500000 based on the
acceptable execution time in my environment (~0.8 seconds).  This number happens
to correspond to x + min( n, M ), so it is evaluated before the calculation
begins.  It throws a #VALUE! error when the max array size is exceeded par
Niklas' suggestion (thanks!).

So... it's ready to go.  I hope you like my patch.

Kohei

Comment 28 niklas.nebel 2005-09-19 11:39:30 UTC

I'm taking this.

Comment 29 niklas.nebel 2005-09-19 11:43:35 UTC

This looks good.
I committed the last patch to CWS "dr41" (for OOo 2.0.1).
Thanks a lot, Kohei.

Comment 30 niklas.nebel 2005-10-07 14:02:08 UTC

back to QA for verification

re-open issue and reassign to fst@openoffice.org

Comment 31 niklas.nebel 2005-10-07 14:02:24 UTC

reassign to fst@openoffice.org

Comment 32 niklas.nebel 2005-10-07 14:02:32 UTC

reset resolution to FIXED

Comment 33 frank 2005-10-18 12:46:54 UTC

Found fixed on cws dr41 for Solaris, Linux and Windows

Comment 34 frank 2005-11-21 13:09:09 UTC

Found integrated on master m142 using Linux, Solaris and Windows Build