Apache OpenOffice (AOO) Bugzilla – Issue 47296
HYPGEOMDIST function: error with big input values
Last modified: 2013-08-07 15:13:53 UTC
Few days ago I have tryed to do my homework with Calc when I use =HYPGEOMDIST(1;50;500;5000) it results zero value in Excel it results 0.028164524 Thank You for the attention. That is what I can do right now. Keep up the good work. best regards andi_p
I moved the issue to the component Spreadsheet. Thanks for the submission.
Hi Daniel, please have a look at this one. Frank
Should be done together with general rework of statistical functions.
title
see also issue 18704
kohei -> dr: I've found a solution. I'll post a patch in a few days. Kohei
Created attachment 29334 [details] test program
Compile this test program by: g++ -Os -o hypgeom hypgeom.cxx and compute the reported test case: [kohei@localhost ooo]$ ./hypgeom 1 50 500 5000 population (N) = 5000 # success (M) = 500 # samples (n) = 50 x = 1 old: 0 new: 0.02816452367904049 which matches the correct value reported by andi_p. This also matches the value computed by R: > dhyper(1, 500, 4500, 50) [1] 0.02816452 Also, execute the program without arguments, which will run all possible cases for N = 0 to 1000 using the old and new algorithms, and compares the numerical differences between the two algorithms. Note this may take a few minutes to complete. Here is the result I get on my machine: [kohei@localhost ooo]$ ./hypgeom max diff: 3.885780586188048e-16 max diff: 3.885780586188048e-16 max diff: 4.440892098500626e-16 [...] max diff: 1.332267629550188e-15 max diff: 1.332267629550188e-15 max diff: 1.332267629550188e-15 maximum difference between old and new methods: 1.332267629550188e-15 number of times comparisons returned a non-equal result: 0 number of comparisons performed: 2254200 What this means is that, of all possible scenarios for N <= 1000, the maximum difference between the old and new algorithms is 1.332E-15, and none was considered non-equal by using approxEqual( a, b ) method. In other words, it's safe to switch to the new algorithm without causing significant deviation from the values calculated by the old algorithm. A patch will follow....
Created attachment 29337 [details] proposed patch against SRC680_m121
The reason why the old algorithm failed is because of an overflow in double-precision variable during a calculation of binomial coefficient (note that a calculation of hypergeometric distribution involves calculations of three binomial coefficients). binomial coefficient: (n, k) = n! / k!(n-k)! In the old method, these three factorials n!, k! and (n-k)! were immediately calculated, which caused an overflow. In my proposed method, I expand these factorials into individual elements e.g. 10! -> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and store them into an enumerator array or a denominator array (whichever is appropriate). These elements get accumulated in these two arrays during all three binomial coefficient calculations. Once that's done, the elements in these two arrays are sorted in descending order (from largest to smallest), and divisions are performed on all stored elements sequentially from the largest elements downward (see my proposed patch for implementation details). This probably is the safest way of calculation when the calculation involves multiple factorial elements. And I recommend the way I calculated the hypergeometric distribution here be applied to other calculation methods that may have a similar numerical overflow problem. Kohei
Setting target to 2.0.1
set issue type to PATCH
Created attachment 29360 [details] Forgot to "const" the function
kohei->dr: Forgot to const SetBinomElements member function as it does not change the object state. Please use this patch instead. Bye, Kohei
I think there should be some kind of limit to the size of the arrays, or we'll have memory problems for large values (in addition to the already-existing runtime problems). Otherwise this looks good and can go into 2.0.1.
Niklas, I can look into the array size issue. Thanks for bringing it up. If no objection, I can just use the maximum allowable size of std::multiset to set array size limit. Will issue another patch soonish. Kohei
Created attachment 29412 [details] revised patch
I ended up setting the array size limit to 50000, but this can be changed (feel free to change). multiset's max_size was 4294967295 in my environment, but a calculation that could make the arrays that big would probably take more than a normal user can tolerate. The array size limit turned out to correspond with the number of successes (M) minus one, i.e. when the number of successes is 49999 it calculates, but if it's 50000 it errors out. A cursory profiling reveals that most time is spend on the SetBinomElements() calls, so if a speedup needs to be done, that's where it needs to be. That said, this is still much better than the current situation IMHO. Oh one more thing. I've removed the static_casts to unsigned long, because switching to just using doubles made absolutely no difference. Simpler the better. Let me know if the patch looks good.
Created attachment 29413 [details] revised revised patch
Fixed a typo (enumerator -> numerator). No functionality change.
Found a better algorithm. Will issue yet another patch soon.
Created attachment 29571 [details] revised test program
Compile and run the test program: [kohei@localhost ooo]$ g++ -DDEBUG -Os -o hypgeom5 hypgeom5.cxx [kohei@localhost ooo]$ ./hypgeom5 123 70000 1234 1234567 population (N) = 1234567 # success (M) = 1234 # samples (n) = 70000 x = 123 old: inf (7.582e-05 sec) new: 8.56113475055104e-10 (0.0004709 sec) The corresponding R's answer is: > dhyper( 123, 1234, 1234567-1234, 70000 ) [1] 8.561135e-10 which matches my answer. :-) I have of course checked many other cases against R, which all turned out to be comparable. Also, just like the previous test program, you can run it without arguments for all cases for N = 0 to 100 in comparison to the old algorithm. It was a "pass" on my machine. The major difference from my first submitted algorithm is that the new algorithm does not put those factorial elements that would be removed later, thus avoiding the unnecessary expansion and shrinkage of the containers. For example, in my first algorithm, sometimes the container expanded to over 100000, only to be reduced to 50 later on. That was way inefficient, and caused the calculation to take way too much time. Because of this improvement, a scenario that took over 10 seconds to complete by my first algorithm is now reduced to a fraction of a second. Note that my new algorithm handles large values pretty good. I have also switched from using multiset to using a sorted vector, which resulted in an extra 50% performance increase. set/multiset is designed to be efficient when a good mix of insertion, search, erasure is involved. But because my new algorithm only does insertion, I considered vector to be the better choice, and it is indeed. :-) I will prepare a patch sometime after tomorrow. I have to go sleep now. Kohei
Niklas or Daniel, What error should I trigger when the array exceeds a set max size? I could just call SetIllegalArgument(), but it's not exactly an argument error. Thanks, Kohei
I think SetNoValue (it shows "#VALUE!") would be best, being a bit more unspecific than "illegal argument". But there isn't really much consistency about which of these errors is used.
Created attachment 29615 [details] yet another revised patch
Here is hopefully my final patch. I have done everything I could think of to reduce the execution speed. The max array size is set to 500000 based on the acceptable execution time in my environment (~0.8 seconds). This number happens to correspond to x + min( n, M ), so it is evaluated before the calculation begins. It throws a #VALUE! error when the max array size is exceeded par Niklas' suggestion (thanks!). So... it's ready to go. I hope you like my patch. Kohei
I'm taking this.
This looks good. I committed the last patch to CWS "dr41" (for OOo 2.0.1). Thanks a lot, Kohei.
back to QA for verification re-open issue and reassign to fst@openoffice.org
reassign to fst@openoffice.org
reset resolution to FIXED
Found fixed on cws dr41 for Solaris, Linux and Windows
Found integrated on master m142 using Linux, Solaris and Windows Build