Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing 
Summary:  HYPGEOMDIST function: error with big input values  

Product:  Calc  Reporter:  andi_p <jaguarandi_dna>  
Component:  editing  Assignee:  frank  
Status:  CLOSED FIXED  QA Contact:  issues@sc <issues>  
Severity:  Trivial  
Priority:  P3  CC:  issues, kyoshida  
Version:  OOo 1.1  
Target Milestone:    
Hardware:  All  
OS:  Windows XP  
Issue Type:  PATCH  Latest Confirmation in:    
Developer Difficulty:    
Issue Depends on:  
Issue Blocks:  18704  
Attachments: 

Description
andi_p
20050413 00:40:12 UTC
I moved the issue to the component Spreadsheet. Thanks for the submission. Hi Daniel, please have a look at this one. Frank Should be done together with general rework of statistical functions. title see also issue 18704 kohei > dr: I've found a solution. I'll post a patch in a few days. Kohei Created attachment 29334 [details]
test program
Compile this test program by:
g++ Os o hypgeom hypgeom.cxx
and compute the reported test case:
[kohei@localhost ooo]$ ./hypgeom 1 50 500 5000
population (N) = 5000
# success (M) = 500
# samples (n) = 50
x = 1
old: 0
new: 0.02816452367904049
which matches the correct value reported by andi_p. This also matches the value
computed by R:
> dhyper(1, 500, 4500, 50)
[1] 0.02816452
Also, execute the program without arguments, which will run all possible cases
for N = 0 to 1000 using the old and new algorithms, and compares the numerical
differences between the two algorithms. Note this may take a few minutes to
complete.
Here is the result I get on my machine:
[kohei@localhost ooo]$ ./hypgeom
max diff: 3.885780586188048e16
max diff: 3.885780586188048e16
max diff: 4.440892098500626e16
[...]
max diff: 1.332267629550188e15
max diff: 1.332267629550188e15
max diff: 1.332267629550188e15
maximum difference between old and new methods: 1.332267629550188e15
number of times comparisons returned a nonequal result: 0
number of comparisons performed: 2254200
What this means is that, of all possible scenarios for N <= 1000, the maximum
difference between the old and new algorithms is 1.332E15, and none was
considered nonequal by using approxEqual( a, b ) method. In other words, it's
safe to switch to the new algorithm without causing significant deviation from
the values calculated by the old algorithm.
A patch will follow....
Created attachment 29337 [details]
proposed patch against SRC680_m121
The reason why the old algorithm failed is because of an overflow in doubleprecision variable during a calculation of binomial coefficient (note that a calculation of hypergeometric distribution involves calculations of three binomial coefficients). binomial coefficient: (n, k) = n! / k!(nk)! In the old method, these three factorials n!, k! and (nk)! were immediately calculated, which caused an overflow. In my proposed method, I expand these factorials into individual elements e.g. 10! > 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and store them into an enumerator array or a denominator array (whichever is appropriate). These elements get accumulated in these two arrays during all three binomial coefficient calculations. Once that's done, the elements in these two arrays are sorted in descending order (from largest to smallest), and divisions are performed on all stored elements sequentially from the largest elements downward (see my proposed patch for implementation details). This probably is the safest way of calculation when the calculation involves multiple factorial elements. And I recommend the way I calculated the hypergeometric distribution here be applied to other calculation methods that may have a similar numerical overflow problem. Kohei Setting target to 2.0.1 set issue type to PATCH Created attachment 29360 [details]
Forgot to "const" the function
kohei>dr: Forgot to const SetBinomElements member function as it does not change the object state. Please use this patch instead. Bye, Kohei I think there should be some kind of limit to the size of the arrays, or we'll have memory problems for large values (in addition to the alreadyexisting runtime problems). Otherwise this looks good and can go into 2.0.1. Niklas, I can look into the array size issue. Thanks for bringing it up. If no objection, I can just use the maximum allowable size of std::multiset to set array size limit. Will issue another patch soonish. Kohei Created attachment 29412 [details]
revised patch
I ended up setting the array size limit to 50000, but this can be changed (feel free to change). multiset's max_size was 4294967295 in my environment, but a calculation that could make the arrays that big would probably take more than a normal user can tolerate. The array size limit turned out to correspond with the number of successes (M) minus one, i.e. when the number of successes is 49999 it calculates, but if it's 50000 it errors out. A cursory profiling reveals that most time is spend on the SetBinomElements() calls, so if a speedup needs to be done, that's where it needs to be. That said, this is still much better than the current situation IMHO. Oh one more thing. I've removed the static_casts to unsigned long, because switching to just using doubles made absolutely no difference. Simpler the better. Let me know if the patch looks good. Created attachment 29413 [details]
revised revised patch
Fixed a typo (enumerator > numerator). No functionality change. Found a better algorithm. Will issue yet another patch soon. Created attachment 29571 [details]
revised test program
Compile and run the test program:
[kohei@localhost ooo]$ g++ DDEBUG Os o hypgeom5 hypgeom5.cxx
[kohei@localhost ooo]$ ./hypgeom5 123 70000 1234 1234567
population (N) = 1234567
# success (M) = 1234
# samples (n) = 70000
x = 123
old: inf (7.582e05 sec)
new: 8.56113475055104e10 (0.0004709 sec)
The corresponding R's answer is:
> dhyper( 123, 1234, 12345671234, 70000 )
[1] 8.561135e10
which matches my answer. :) I have of course checked many other cases against
R, which all turned out to be comparable.
Also, just like the previous test program, you can run it without arguments for
all cases for N = 0 to 100 in comparison to the old algorithm. It was a "pass"
on my machine.
The major difference from my first submitted algorithm is that the new algorithm
does not put those factorial elements that would be removed later, thus avoiding
the unnecessary expansion and shrinkage of the containers. For example, in my
first algorithm, sometimes the container expanded to over 100000, only to be
reduced to 50 later on. That was way inefficient, and caused the calculation to
take way too much time. Because of this improvement, a scenario that took over
10 seconds to complete by my first algorithm is now reduced to a fraction of a
second.
Note that my new algorithm handles large values pretty good.
I have also switched from using multiset to using a sorted vector, which
resulted in an extra 50% performance increase. set/multiset is designed to be
efficient when a good mix of insertion, search, erasure is involved. But
because my new algorithm only does insertion, I considered vector to be the
better choice, and it is indeed. :)
I will prepare a patch sometime after tomorrow. I have to go sleep now.
Kohei
Niklas or Daniel, What error should I trigger when the array exceeds a set max size? I could just call SetIllegalArgument(), but it's not exactly an argument error. Thanks, Kohei I think SetNoValue (it shows "#VALUE!") would be best, being a bit more unspecific than "illegal argument". But there isn't really much consistency about which of these errors is used. Created attachment 29615 [details]
yet another revised patch
Here is hopefully my final patch. I have done everything I could think of to reduce the execution speed. The max array size is set to 500000 based on the acceptable execution time in my environment (~0.8 seconds). This number happens to correspond to x + min( n, M ), so it is evaluated before the calculation begins. It throws a #VALUE! error when the max array size is exceeded par Niklas' suggestion (thanks!). So... it's ready to go. I hope you like my patch. Kohei I'm taking this. This looks good. I committed the last patch to CWS "dr41" (for OOo 2.0.1). Thanks a lot, Kohei. back to QA for verification reopen issue and reassign to fst@openoffice.org reassign to fst@openoffice.org reset resolution to FIXED Found fixed on cws dr41 for Solaris, Linux and Windows Found integrated on master m142 using Linux, Solaris and Windows Build 