Apache OpenOffice (AOO) Bugzilla – Issue 121421

rand() behaves poorly on some platforms

Last modified: 2013-07-12 11:13:42 UTC

Calc uses libc's rand(3) as a random number generator in Calc. The code is in: main/sc/source/core/tool/interpr1.cxx ScInterpreter::ScRandom() In general rand() is not a very good random number generator, especially on Windows: BSD 4.2 libc introduced random(3) which in general behaves better but may not be available in all platforms. Interestingly the methods used by MS-Excel are well documented and relatively easy to generate: MS-Excel up to 97 used this random generator: http://support.microsoft.com/kb/86523 For 2003, and upper they use a better method: http://support.microsoft.com/kb/828795 I would suggest implementing this last method.

It appears, based on the problems that Excel had getting a correct implementation of the Wichman-Hill algorithm, that one must be very careful about attempting a literal transcription of the method. First, salting must ensure that the initial IX, IY, and IZ values are positive nonzero values less than the corresponding modulus values (30269, 30307, and 30323). In addition, the possibility of values <= 0 as the result of integer multiplication overflows must be protected against. Note that an intermediate product before any MOD is taken can be as large as 5,212,632 and the division in the implementation of MOD must work with dividends that large as well. I'm also marginally suspicious of all of those floating-point divisions, since the moduli are all relatively prime to 2 (and 5). The arithmetic in the RANDOM line should all be in DOUBLE (if not greater) and it is not clear how many bits of the AMOD result should be considered reliable, since recurrence patterns will happen. Also, before claiming DIEHARD, it is really important to actually run it on the implemented code. I can't imagine that the defects noted in the Excel implementation would not have been caught. I just made an Excel 2010 Spreadsheet with 500 =RAND() cells. I don't see any of the problems that were noticed in the testing that was reported in 2004. It works fine in Excel 2013 Release Preview too. Finally, Excel 2003 SP3 also does just fine. Saving as an .XLS from 2003, it opened fine in AOO 3.4.1 and LibreOffice 3.6.2 although I did not do any testing -- just eye-balling, so I don't know about duplicates or other repetition problems. I am seeing =RAND() as decimal fractions to 9 or 10 decimal digits, depending on the application. I did not convert them to some range of integers and I didn't think to put them in a CSV and sort them. Some other time.

(In reply to comment #1) > It appears, based on the problems that Excel had getting a correct > implementation of the Wichman-Hill algorithm, that one must be very careful > about attempting a literal transcription of the method. > > First, salting must ensure that the initial IX, IY, and IZ values are > positive nonzero values less than the corresponding modulus values (30269, > 30307, and 30323). > > In addition, the possibility of values <= 0 as the result of integer > multiplication overflows must be protected against. Note that an > intermediate product before any MOD is taken can be as large as 5,212,632 > and the division in the implementation of MOD must work with dividends that > large as well. > The FORTRAN comment states clearly that IX, IY and IZ must be values between 1 and 30000. > I'm also marginally suspicious of all of those floating-point divisions, > since the moduli are all relatively prime to 2 (and 5). The arithmetic in > the RANDOM line should all be in DOUBLE (if not greater) and it is not clear > how many bits of the AMOD result should be considered reliable, since > recurrence patterns will happen. > I would think the divisions are made precisely to make the remainders more random. One thing that where we will necessarily lose is speed though: on FreeBSD random(3) is 2/3 the speed of rand(3). This algorithm will be at least three times slower. > Also, before claiming DIEHARD, it is really important to actually run it on > the implemented code. I can't imagine that the defects noted in the Excel > implementation would not have been caught. > Since the implementation is in C we could actually take the C code for comparison purposes before importing it. I don't see myself running the Diehard tests as I am rather lazy to fill forms though ;).

Created attachment 79968 [details] sample implementation I made a proof-of-concept implementation of the newer MS-Office method in C. Getting it right in C involved some subtle math tricks due to differences with FORTRAN but it looks like it is well behaved. To try it just cc -o newrand newrand.c ./newrand FWIW, the standard rand() function in FreeBSD is not good at all, the results are much better with random(). Still even when using rand() the results from this algorithm look good.

(In reply to comment #2) > (In reply to comment #1) [ ... ] > > First, salting must ensure that the initial IX, IY, and IZ values are > > positive nonzero values less than the corresponding modulus values (30269, > > 30307, and 30323). > The FORTRAN comment states clearly that IX, IY and IZ must be values between > 1 and 30000. I don't think it matters but it would be interesting to know what Wichman-Hill said about it. The use of Wichmann-Hill in the system, R, seeds with random values in the constraints I state. It seems strange to restrict the salt values to ones less than those that can be produced and consumed by the respective functions. > > > I'm also marginally suspicious of all of those floating-point divisions, > > since the moduli are all relatively prime to 2 (and 5). The arithmetic in > > the RANDOM line should all be in DOUBLE (if not greater) and it is not clear > > how many bits of the AMOD result should be considered reliable, since > > recurrence patterns will happen. > > > > I would think the divisions are made precisely to make the remainders more > random. One thing that where we will necessarily lose is speed though: on > FreeBSD random(3) is 2/3 the speed of rand(3). This algorithm will be at > least three times slower. The divisions just change the values from the integer modular generators into fractions, x, in the range 0 < x < 1. It doesn't do anything to increase randomness. The original values can be reconstructed pretty easily, especially with such small modulus values. Adding them together and discarding the integer part is what makes this generator interesting. There's also a serious error in the Microsoft description. It is true that if those fractions were random on [0,1] the fractional part of their sum is also random. But the IX, IY, and IZ sequences are hardly random at all. So the claim is insufficient as a rationale for the heightened randomness of the combination. The only remark that Donald Knuth makes on the Wichmann-Hill generator is that they provide a reliable method for computing the mod function of a double-word value and a single-word modulus. This arises in the solution to Exercise 3.2.1.1(9) in The Art of Computer Programming. I nosed around a bit more. There are special requirements on the initial seed to ensure that a maximum period is obtained. There are other conditions that I was unable to dig into quickly. [I collided with your proof of concept. I'll look at it now.]

Created attachment 79970 [details] Proof of concept in C This is more faithful to the original algorithm.

Created attachment 79971 [details] Proof of concept in C Oops ... small typo in a coefficient.

(In reply to comment #4) ... > > I nosed around a bit more. There are special requirements on the initial > seed to ensure that a maximum period is obtained. There are other > conditions that I was unable to dig into quickly. > In the test program I also notice there are issues with the seed: this is usually based on the time so running the program too fast will generate repeated values. I would think that Calc doesn't have that problem. as srand() is called elsewhere. > [I collided with your proof of concept. I'll look at it now.] I made some mistakes at first and it was just for fun, but it seems to give good results. One issue I see is that it is impossible to get a value of 1.

(In reply to comment #4) > > There's also a serious error in the Microsoft description. It is true that > if those fractions were random on [0,1] the fractional part of their sum is > also random. But the IX, IY, and IZ sequences are hardly random at all. So > the claim is insufficient as a rationale for the heightened randomness of > the combination. > > The only remark that Donald Knuth makes on the Wichmann-Hill generator is > that they provide a reliable method for computing the mod function of a > double-word value and a single-word modulus. This arises in the solution to > Exercise 3.2.1.1(9) in The Art of Computer Programming. > Apparently you have to pay to read this but the the title says it all ;). http://www.sciencedirect.com/science/article/pii/S016794730800162X Cheers, Pedro.

(In reply to comment #8) > (In reply to comment #4) > > > > Apparently you have to pay to read this but the the title says it all ;). > > http://www.sciencedirect.com/science/article/pii/S016794730800162X That's old news, I think. There was also an older report that is not behind a pay-wall.

(In reply to comment #7) > (In reply to comment #4) > > [I collided with your proof of concept. I'll look at it now.] > > I made some mistakes at first and it was just for fun, but it seems > to give good results. One issue I see is that it is impossible to get > a value of 1. That's correct, the results are strictly 0 < r < 1. I made a version, also adjusting to your latest changes. This version will generate sets of results for analysis and inspection. If you just run it, it makes 1000. What's more fun is >newrand 10000000 | sort | more Note that the individual IX, IY, and IZ will have begun to recycling several times, but not in synchronization. Sorting on one of those columns could be even more interesting.

Created attachment 79972 [details] Alternative for Generating Sets of Results This is adjusted to generate long runs of the random values. Enough digits are printed of the 0 < r < 1 random doubles so that where the values tail off at the extremes of precision (and the output conversion software) can be observed. An optional command-line parameter can specify the length of the sequence you want. The default for the parameter is 1000. Interesting results appear with runs of 1000000 or more. By then, each of the three short-width generation functions will have recycled a few times, but not in synchronization.

(In reply to comment #11) > Created attachment 79972 [details] > Alternative for Generating Sets of Results > > This is adjusted to generate long runs of the random values. Enough digits > are printed of the 0 < r < 1 random doubles so that where the values tail > off at the extremes of precision (and the output conversion software) can be > observed. > > An optional command-line parameter can specify the length of the sequence > you want. The default for the parameter is 1000. Interesting results > appear with runs of 1000000 or more. By then, each of the three short-width > generation functions will have recycled a few times, but not in > synchronization. Ignoring MAX_RAND is a very bad idea. I also found my implementation is incomplete: to avoid precision issues the complete implementation does some adjustments that involve substracting. That may be cause for microsoft's bug. The algorithm is rather old and thought for 16 bit architectures. There has been a revision in 2007 (which obviously didn't make it into Office 2003). I will update this soon.

(In reply to comment #12) > Ignoring MAX_RAND is a very bad idea. That would only be the case if the % failed in the initialization. There is no indication that has happened. A serious implementation would have to look at the seeding far more carefully. Although the %-operation I used to simplify things puts some bias into the seed values, this didn't seem important for proof-of-concept. (There is bias in the choking of seeds to under 30000 already. I simply don't believe that comment in the Fortran code.) > I also found my implementation is incomplete: to avoid precision issues the > complete implementation does some adjustments that involve substracting. > That may be cause for microsoft's bug. This is apparently not a problem on systems where sizeof(int) > 3. If it were, we could see negative IX,IY,IZ values at some point. It might be safer to declare those as longs, though. It will be interesting to see if that changes anything. > > The algorithm is rather old and thought for 16 bit architectures. There has > been a revision in 2007 (which obviously didn't make it into Office 2003). I > will update this soon. I noticed that too. The maximum periods of the individual IX, IY, and IZ series are terribly small. Do you have a reference on any analysis of the 2007 version and the rationale for the choice of its parameters? (In this case, watching out for multiplication range errors and %-buts will also be more important.)

Please notice the discussions in the thread http://www.mail-archive.com/libreoffice@lists.freedesktop.org/msg46285.html, to use the solutions from boost.

(In reply to comment #13) > (In reply to comment #12) > > > Ignoring MAX_RAND is a very bad idea. > That would only be the case if the % failed in the initialization. > The seed will be biased toward lower values. It's not a serious issue but it doesn't buy us anything either. > > I also found my implementation is incomplete: to avoid precision issues the > > complete implementation does some adjustments that involve substracting. > > That may be cause for microsoft's bug. > This is apparently not a problem on systems where sizeof(int) > 3. If it > were, we could see negative IX,IY,IZ values at some point. It might be > safer to declare those as longs, though. It will be interesting to see if > that changes anything. For the 2006 implementation you do need to do the adjustment to avoid 64 bit integer math. > > > > The algorithm is rather old and thought for 16 bit architectures. There has > > been a revision in 2007 (which obviously didn't make it into Office 2003). I > > will update this soon. > I noticed that too. The maximum periods of the individual IX, IY, and IZ > series are terribly small. Do you have a reference on any analysis of the > 2007 version and the rationale for the choice of its parameters? (In this > case, watching out for multiplication range errors and %-buts will also be > more important.) I found this on the net: http://www.eurometros.org/file_download.php?file_key=247 However I am having two problems: 1) seeding 2) It's not working (?).

(In reply to comment #14) > Please notice the discussions in the thread > http://www.mail-archive.com/libreoffice@lists.freedesktop.org/msg46285.html, > to use the solutions from boost. I am aware of that (plus I updated boost) but I don't want to add many dependencies on it if I can avoid it. On the other hand we are using rand() on other places and this might be a good opportunity to clean that up.

(In reply to comment #15) > (In reply to comment #13) > > (In reply to comment #12) [ ... ] > > > The algorithm is rather old and thought for 16 bit architectures. There has > > > been a revision in 2007 (which obviously didn't make it into Office 2003). I > > > will update this soon. > > I noticed that too. The maximum periods of the individual IX, IY, and IZ > > series are terribly small. Do you have a reference on any analysis of the > > 2007 version and the rationale for the choice of its parameters? (In this > > case, watching out for multiplication range errors and %-buts will also be > > more important.) > > I found this on the net: > http://www.eurometros.org/file_download.php?file_key=247 > > However I am having two problems: > 1) seeding > 2) It's not working (?). That's a nice paper. I'll see if I can get it working. The explanation is pretty clear, and it identifies the parts of the original that seem pretty suspicious. There is good analysis of what the actual maximum periods are too. The adjustments for 32-bit arithmetic are important. It is also interesting about performance comparisons and the use to generate multiple long sequences that do not correlate.

(In reply to comment #16) > (In reply to comment #14) > > Please notice the discussions in the thread > > http://www.mail-archive.com/libreoffice@lists.freedesktop.org/msg46285.html, > > to use the solutions from boost. > > I am aware of that (plus I updated boost) but I don't want to add many > dependencies on it if I can avoid it. > > On the other hand we are using rand() on other places and this might be a > good opportunity to clean that up. That's a nice find. The recommended boost::mt19937 is the Mersenne Twister. It has an insane period accomplished with a 2500 byte "state". Even so, it runs about 4 times as fast as the improved Wichmann-Hill algorithm in their 2005 paper's estimation. The improved Wichmann-Hill may rank better on x64 though.

(In reply to comment #15) ... > However I am having two problems: > 1) seeding > 2) It's not working (?). I got it working .. I had a typo that left an open comment :-P. Now I have to find a nice way to generate the seed. The old algorithm had a range for the seed. The new seed doesn't seem to have that condition. Concerning boost.. it would be really nice to have private extensions for other random functions. For the default case Wichmann-Hill 2006 looks good enough.. IMHO.

(In reply to comment #18) > That's a nice find. The recommended boost::mt19937 is the Mersenne Twister. > It has an insane period accomplished with a 2500 byte "state". Even so, it > runs about 4 times as fast as the improved Wichmann-Hill algorithm in their > 2005 paper's estimation. The improved Wichmann-Hill may rank better on x64 > though. I knew my ACM Digital Library subscription was good for something. The Mersenne Twister paper is the most-heavily cited paper of any in ACM TOMAS (Transactions on Modeling and Simulation). It is also the most downloaded paper. In addition to extensive analysis and provision of an abstract algorithm, there is a C Language implementation of mt19937 in the paper. This is apparently code-identical to the method used in boost::mt19937. According to the boost documentation, there was a tweak concerning the seed process when given an integer value as starter for the internal seed generation. That is probably unimportant for now. I don't know if that was an implementation issue or a problem in the original paper. I do notice in the C code that everything in MT Twister is done with 32-bit unsigned integers, although it returns a [0..1] float. It looks like 1 is a possible result. I can't post the paper, but I can extract the C code, for comparison with the 32-bit Wichmann-Hill. PS: Wichmann-Hill can return 0, but it can't return 1. I forgot that x - int(x) can catch a 0 (including by underflow) when the sum of fractions is near enough to 1, 2, or 3.

(In reply to comment #20) > I can't post the paper, but I can extract the C code, for comparison with > the 32-bit Wichmann-Hill. The Wikipedia treatment is good enough, and there are extensive links to implementations: http://en.wikipedia.org/wiki/Mersenne_twister I haven't checked the pseudo-code against the C version in the original paper.

Created attachment 79976 [details] Rough implementation Here is a very rough patch that I haven't even compile-tested. The idea is now to generate a basic seed using rand() to generate a long int and put it a constructor somewhere so that it's generated once.

Created attachment 79977 [details] Rough implementation (WIP) Still haven't compiled it but I found several typos.

Created attachment 79978 [details] Rough implementation (WIP) Still more cleanups.

(In reply to comment #24) > Created attachment 79978 [details] > Rough implementation (WIP) > > Still more cleanups. For the initialization, you can try something like this: long SimpleSeeder(long prev) { return (((69069L * prev) & 0x3FFFFFFF) | 1; /* not all that random but nonzero and less than the moduli, all up near 0x7FFFFFFF */ } ix = SimpleSeeder((long) time()); iy = SimpleSeeder(ix); iz = SimpleSeeder(iy); it = SimpleSeeder(iz); Not sure how you need to fit this into an initialization of static variables. I looked around for better initial generators and gave up for now. This hack should work for a while.

Created attachment 79981 [details] Experimental implementation This is a first version: it is still experimental and I would like some review before committing but it works here. My C++ is a bit rusty: originally I tried to define the seed as a private member of the ScInterpreter and to make ScGlobal a friend class that would set up the seed but I got trouble linking. If a reviewer with C++ fu wants to give it a try, let me know.

Created attachment 79982 [details] Improved Generator for producing sets of results (In reply to comment #25) > For the initialization, you can try something like this: > > long SimpleSeeder(long prev) > { return (((69069L * prev) & 0x3FFFFFFF) | 1; > /* not all that random but nonzero and less than > the moduli, all up near 0x7FFFFFFF */ > } This was the right idea, but the bigger problem was having sufficient variation from starter seeds chosen only a few seconds apart. Here is a program using the 32-bit updated Whitmann-Hill pseudo-random generator, similar to the one Pedro has put in the form of a patch to Calc. This one can be run standalone to see how varied the sequences are, how there is useful variance in the initial result as well as the continuation of a long series, etc. This is not the same as strenuous testing. It's time for that now.

Committed revision 1416271. I was able to completely remove the use of rand() by using the existing support from rtl/random to generate the seeds. Some hunting around the tree for uses of rand() would be good. I have tested my new implementation: it becomes a little sluggish after adding more than 1 million entries (what doesn't) but otherwise it seems to work well. Still adding additional random generation engines for boost as scaddins would be really nice.

Created attachment 79984 [details] Improved Generator for testing sets of results 0.02 For completeness, here is an improved standalone test of the Wichmann-Hill 32-bit pseudo-random generator from 2005-2006. Because it is standalone, instead of customized for Calc, there is an snewrandom() function for specifying a seed. Although a random source can be used, this will also provide good separation even with regular seeds (1,2,3,...) or using something with low entropy such as too-recent repetitions of time(NULL). I also made adjustments to stay within ISO/IEC C 1995. There should be no platform dependencies so long as the long type has at least 4 bytes. This may be useful in comparisons as well as testing the basic WH PRG.

Created attachment 79986 [details] Improved Generator for testing sets of results 0.03 Since I put code into the wild, I have cleaned up the edge cases and put in some defensive heuristics against "the bad seed." I'm done with this here. I may make a little library version, but that has nothing to do with the verification of WH2006-style algorithms as improvements for Calc, so I'll find a more general place for that work.

Created attachment 79990 [details] Final Generator for Generating Sets of results 0.04 OK, one last time. I finally found the theoretical foundation for making all of the pathological cases go away. It was very interesting to find that and see how to simplify the handling of user-supplied initial seeds. I'm making this replacement so that no one will be attempted to use the ones with deficiencies that I uploaded before. I am definitely done for now.

(In reply to comment #31) > Created attachment 79990 [details] > Final Generator for Generating Sets of results 0.04 > I'm making this replacement so that no one will be attempted to use the ones > with deficiencies that I uploaded before. So now one will be *tempted" ...

FWIW; Looking at: main/scaddins/source/analysis/analysis.cxx RANDBETWEEN uses libc's rand(): it would be great if if could use the standard RAND() instead. It looks pretty easy to solve but I don't know much about scaddins, any help the would be welcome.

"pfg" committed SVN revision 1418174 into trunk: i121421 - Mostly cosmetic fixes.

"pfg" committed SVN revision 1438322 into trunk: i121421 - Calc's RAND() behaves poorly on most platforms.