SA Bugzilla – Bug 1338
New ranking specifically for figuring out which tests to keep/not
Last modified: 2004-03-11 05:57:37 UTC
Hi. I've been doing some reading on Bayesian statistics, and have come across an interesting method of figuring out "how much data does this test normally give us" that turns out to also be found in Shannon information theory. See _Bayesian Theory_ by Bernando and Smith, pg 82 (further bibliographic details available on request) and Shannon, 1948, "A Mathematical Theory of Communication", as reprinted in _The Mathematical Theory of Communication_, 1949, Shannon and Weaver). I put together a mode for this in hit-frequencies. It's specifically of use for figuring out how much utility a test has, not so much for figuring out what the score-range should be - there is a difference in this case. I'll attach the patch (which also allows hit-frequencies to process net and userconf tests with a -a flag). -Allen
Created attachment 535 [details] Patch to provide another ranking method (-N p(spam) or -n)
Oh. -n is the equivalent of -x for this ranking. The ranking does vary by what the probability of spam is that one assumes; use -N p(spam) to set it (which will also do -n if that isn't already set). I wouldn't use a p(spam) over 0.5 (the default); p(spam) of 0.1 should be a cautious setting, although I'm still experimenting with the effects of different p(spam) settings on this. -Allen
Subject: Re: [SAdev] New ranking specifically for figuring out which tests to keep/not Do we have a translator? :-)
> Do we have a translator? :-) Heh... you speak math? And know a good way for me to type it in? -Allen
Part of this, BTW, is the realization that there can be a distinction between "what tests do we want to keep" and "what tests should have the highest-magnitude scores". A test can be trusted with a high-magnitude score if it has a low false-positive/negative rate. A test should be kept if it tends to provide a lot of data - and that means it has a lot of hits and _most_ of those are spam or nonspam. A test that has a S/O of 1 but only a couple of hits may not be worth doing, particularly for high-load tests (net tests, for instance) - it doesn't provide that much information in the average case. The formula I've programmed into hit-frequencies looks at the latter question, in that it is trying to figure out how much information a test tends to provide. -Allen
Subject: Re: [SAdev] New ranking specifically for figuring out which tests to keep/not bugzilla-daemon@hughes-family.org writes: > Part of this, BTW, is the realization that there can be a distinction > between "what tests do we want to keep" and "what tests should have > the highest-magnitude scores". I understand your goal (if not the math), but why not just have one RANK? I generally use the current RANK number to determine whether tests are worth keeping or not, but I don't rely 100% on it -- I also factor in a lot of factors such as S/O ratio, hit rates, speed of the test, overlap with other tests, FP/FN reduction, etc. Basically, if you can come up with a new RANK that agrees with me more often than the current RANK, I think we should switch. :-) Anyway, can you publish methods for selecting the following: - the worst 10 tests - the worst 10% tests (You probably should use all available nightly test results from the last day or two.) Then, let's look at the worst 10 and 10% tests using the current RANK and your new RANK and see which agrees with our gut feeling better. If the new one is better, let's just make it the RANK. Another way is to remove the worst 10% using either method, run the GA, and then see which one does better on a separate test corpus.
> I understand your goal (if not the math) I only really understand what the math _means_, not enough to re-derive it! Think of SA tests as a noisy communications channel. For a noiseless channel, the maximum information is conveyed per-symbol if the symbols are equally likely. For binary symbols with placement determining meaning (the closest analogy for SA tests), the most information is conveyed if there's a 50/50 chance for a 0 or a 1. Otherwise, the more common of the pair doesn't tell you much (it's common, so you'd expect it anyway) and while the less common of the pair means more, it's also less common, so you don't _get_ that extra meaning very often. SA tests have noise - have a soratio other than 1 or 0 - so communicate a bit less information (that's _mostly_ factored in with the stuff inside the log() function - I reduced down the original by canceling out terms, with the original being, essentially, observed/expected (thus a log of 0 if equal)). > why not just have one RANK? score-ranges-from-freqs uses the existing scoring method, and as I stated, this method of RANKing is specifically not as useful for figuring out how trustworty a test is - how high-magnitude a test's score should be allowed to be. If a test has a high RANK by this method but a S/O that isn't 1 or 0, then it probably should be used but with a lower score (so that it can make a difference in combination with others with lower scores) - if the S/O is really too close to 0.5 than is tolerable, but it has a high RANK by this method, then it is a likely candidate for usage as a sub-component of meta tests. > I generally use the current RANK number to determine whether > tests are worth keeping or not, but I don't rely 100% on it I wouldn't do so either, including with this new one - but I believe it does better integrate the hit ratios and soratio for determining whether tests are worth keeping or not. >Anyway, can you publish methods for selecting the following: > - the worst 10 tests > - the worst 10% tests The worst are going to be the same in any case: A. Non-nice tests with soratios below 0.5, and nice tests with soratios above 0.5. B. 0-hitters. See below for an alternative comparison. >(You probably should use all available nightly test results from the >last day or two.) OK, I'll download them from the RSYNC server. (BTW, http://www.pathname.com/~corpus/ needs to be adjusted temporarily for that we're doing a HEAD, non-dated, mass-check run; it isn't including mine, probably because it wasn't a NIGHTLY one and because it was completed before 9AM UTC.) >Then, let's look at the worst 10 and 10% tests using the current RANK >and your new RANK and see which agrees with our gut feeling better. Modulo that the best to compare will probably be, say, which ones were below 0.8 or so with the old RANK and which ones are in the equivalent decile with the new RANK, agreed. > If the new one is better, let's just make it the RANK. Again, it isn't suitable for that. I have in mind one possible replacement for the current RANK in figuring out score ranges (using a variant of the Robinson f(w) equation on the soratio), but it's going to require more testing (GA runs) than there's really going to be time for before we want to get 2.50 out the door. -Allen
I've done some examination of the results of applying the new ranking method to current rsync-server logs. The results were an interesting mixture of "yes, that's how it should be" and "WTH?". I did some further analysis and determined: A. One problem was some people running with BAYES and some not, so the BAYES_ results were very irregular. B. The old method, due to its heavy usage of soratios, spotted cases where a test had inconsistent results between corpi - not simply some people having hits and others not, which is tolerable, and not simply a test being consistently "just OK", which is likewise tolerable if the test has a high enough hit rate and/or can be combined with other tests to eliminate some of the FPs/FNs, but where applying said test would be downright disasterous in some circumstances. C. The new method did do very well at spotting tests that simply don't have enough data for them for the old ranking method to be reliable, and at spotting tests that, if they involve heavy load for one reason or another (e.g., network tests), aren't good enough to be worth it under most circumstances. D. Assuming a p(spam) of 0.1 gives overall better results, as I suspected. Given the above, I have come up with both a revised hit-frequencies patch and a program specifically for use with multiple corpora. The latter evaluates for each corpus (spam/nonspam file pair) whether a test is highly problematic for that corpus. If this is true for any corpus for a given test, or if the test has less than 10 hits across all corpi (4 would be another logical number to use), then the worse of the old or new ranks is used for ordering. If neither of these is the case, then the better of the old and new ranks is used for ordering. I suggest that this combination is good for general usage. For cases of tests with heavy load, I suggest using the new ranking method for deciding which tests to have enabled by default, and the combination for which ones to make available to people who can tolerate the additional load (e.g., have locally cached versions for RBL tests), along with other information (overlaps et al) for both cases. I will attach the new patch, the new program, data on which tests look "bad" on the old comparison but not on the combined method, data on which tests look "bad" on the combined method but not on the old method, and the full listing of the combined method's results for the tests that look "bad". -Allen
Created attachment 538 [details] Revised patch to hit-frequencies
Created attachment 539 [details] Program to compare corpi and combine ranking methods accordingly
BTW, the method to use the program for comparing corpi and combining ranking methods (hit-frequencies-multiple.pl is my current name for it) is to run it with all applicable options for hit-frequencies, except that -x and -n are unnecessary (-N can be used to supply a p(spam) different from 0.5; 0.1 seems to work well), followed by the names of each of the corpuses. It expects that corpi will be in the form spam-[name].log and nonspam-[name].log; it is capable of using corpi with only one or the other, albeit only in the overall ranking and not in determining tests that have severe problems on a particular corpus. -Allen
Created attachment 540 [details] Tests that the old ranking would suggest to remove but the combined method (with -N 0.1) would not
Created attachment 541 [details] Tests that the combined ranking (with -N 0.1) would suggest to remove but the old would not
Created attachment 542 [details] Full data on which tests the combined ranking (with -N 0.1) would suggest to remove
Created attachment 556 [details] hit-frequencies-multiple.pl revision to allow for ham-{name}.log
While waiting on the DNS testing to finish now that Osirusoft is back up, I decided to work a bit on the idea of taking tests that the old ranking would suggest to remove, but the new ranking would keep, and using them for meta tests. I took the mass-check logs from everyone as of Jan 4th, ran my initial version of a possible-meta detection program on it (detect-meta.1.pl; it's limited to non-nice tests and positive correlations only at the moment, and doesn't handle __* tests all that well as yet), did some post-filtering (especially for things involving __* tests - I rather doubt people using (for some masochistic (sp?) reason) Outlook as a mailer would be happy with meta tests using Outlook as a spamsign...), and selected out some possible meta tests. (These are focused on ones using rules that would be tossed out by the old ranking, but I also put in a few others I happened to notice.) I then wrote another, again rather simplistic/first-draft, to create the derived meta rules (create-meta.1.pl). Now, before everyone screams at me that we're in a rules freeze and are about to start the GA run, I accounted for that. I also created a program, simulate-meta.1.pl, that simulates the simple variety (A && B) of meta tests that the previous two programs create (I decided to stay away, for the moment, from trying to do a full eval simulation of all meta tests - this approach is less likely to have eval/quoting/etcetera bugs), taking as input a file of said tests and a mass-check logfile, and outputting a logfile with the new meta tests added. Thus, we don't need to rerun mass-check in order to incorporate these meta tests, except that it might be a good idea to make sure that we don't add so many that the number of them overwhelms SA... I say this because I wound up creating a total of 3412 of the silly things! I really don't suggest adding all of these just yet - only the ones with a soratio of 1. I will nonetheless attach a listing of all of them plus a hit-frequencies ranking of them all (using the old ranking method, it being more suitable for this purpose). -Allen
Created attachment 557 [details] detect-meta.1.pl - first draft of program to find possible new meta tests
Created attachment 558 [details] create-meta.1.pl - first draft of program to create new meta tests from detect-meta.1.pl results
Created attachment 559 [details] simulate-meta.1.pl - program to add simple (A && B) meta tests to mass-check logs
Created attachment 560 [details] Possible new meta tests, semi-automatically created
Created attachment 561 [details] hit-frequencies (for mass-check logs from Jan 4th) results (old ranking) for possible new meta tests
>I really don't suggest adding all of these just yet - only the ones with a >soratio of 1. Make that "only the ones with a soratio of 1.000 which are in the top 100-200 by the old ranking system", actually. I hadn't realized exactly how many have a soratio of 1.000... -Allen
Oh: A. All of this is licensed as per the current License file; and B. Someone (I can try to do it this evening) will need to process Malte's mass-check files for the new ranking and for the results with the new meta tests. (For both of these, having info from as many people as possible is of even greater importance than it was before...) -Allen
Created attachment 567 [details] New hit-frequencies -p -m T_META with Malte's logs added
Created attachment 568 [details] New version of what hit-frequencies-multiple.pl -N 0.1 would suggest to remove, with Malte's info added
I would suggest that getting the GA runs finished and new score sets generated, then doing a release, will be enough work in the short term ;) -- esp. if we add in the new DNS code as well. If we add this stuff now, we'll never get 2.50 out ;) IMO, better to leave this one for the new dev cycle when that starts. (the logs will, of course, still be valid at that stage, so we can just pick it up as-is at that point.) opinions?
>I would suggest that getting the GA runs finished and new >score sets generated, then doing a release, will be enough >work in the short term ;) -- esp. if we add in the new DNS >code as well. I certainly hope it can be added... >If we add this stuff now, we'll never get 2.50 out ;) Point. I'd tried to simplify adding this stuff in as much as possible via simulate-meta.1.pl, but I can see how it might wind up being a bit overwhelming. I would like to ask that the new ranking method be kept in mind when deciding which tests to delete and which to keep. (I will be travelling on Monday and Tuesday and am not likely to be able to be in touch at all, BTW.) >IMO, better to leave this one for the new dev cycle when >that starts. (the logs will, of course, still be valid >at that stage, so we can just pick it up as-is at that >point.) The logs can be reused via simulate-meta.1.pl, yes. -Allen
I agree that this is not required for 2.50, so I'm setting to 3.0.
All material regarding new meta tests on this bug should instead be on 1363, which is for 2.6/3.0. -Allen
Current network-testing results, new ranking (-N 0.1), using: ham-nobayes-net-easmith.log ham-nobayes-net-jm.log ham-nobayes-net-kramer.log ham-nobayes-net-mss.log ham-nobayes-net-quinlan2.log (Dan, is this the one that should be used?) ham-nobayes-net-rODbegbie.log ham-nobayes-net-theo.log and the equivalent spam-nobayes-net-*.log files. I will attach the new versions of hit-frequencies-multiple.pl and hit-frequencies. OVERALL SPAM NONSPAM S/O RANK1 RANK2 SCORE NAME 166976 39614 127362 0.237 0.00 0.00 0.00 (all messages) 28428 18284 10144 0.853 0.66 0.99 0.38 RCVD_IN_OSIRUSOFT_COM 2678 2669 9 0.999 0.96 0.94 0.30 X_OSIRU_SPAMWARE_SITE 19188 12798 6390 0.866 0.67 0.98 0.01 RCVD_IN_NJABL 6838 6269 569 0.973 0.90 0.97 2.00 X_NJABL_OPEN_PROXY 5274 5106 168 0.990 0.95 0.97 3.18 RCVD_IN_SBL 1382 1381 1 1.000 0.96 0.88 3.25 RCVD_IN_DSBL 2145 2136 9 0.999 0.96 0.92 0.01 RAZOR2_CF_RANGE_91_100 4277 4137 140 0.990 0.94 0.96 3.91 RAZOR2_CHECK 11555 7678 3877 0.864 0.65 0.95 2.28 RCVD_IN_RFCI 1000 0 1000 0.000 0.96 0.49 -4.00 HABEAS_SWE 18486 9646 8840 0.778 0.49 0.93 0.77 RCVD_IN_UNCONFIRMED_DSBL 1871 1854 17 0.997 0.95 0.90 1.00 RCVD_IN_OPM 86 86 0 1.000 0.96 0.41 0.01 RAZOR2_CF_RANGE_81_90 1494 1412 82 0.982 0.91 0.86 0.61 RCVD_IN_RELAYS_ORDB_ORG 407 398 9 0.993 0.94 0.66 0.01 RAZOR2_CF_RANGE_21_30 145 138 7 0.984 0.91 0.47 0.01 RAZOR2_CF_RANGE_41_50 69 65 4 0.981 0.90 0.34 0.01 RAZOR2_CF_RANGE_61_70 1665 1557 108 0.979 0.90 0.87 2.25 RCVD_IN_ORBS 160 145 15 0.969 0.87 0.45 0.01 RAZOR2_CF_RANGE_11_20 61 57 4 0.979 0.90 0.32 0.01 RAZOR2_CF_RANGE_31_40 45 42 3 0.978 0.90 0.28 0.01 RAZOR2_CF_RANGE_51_60 155 137 18 0.961 0.85 0.43 0.01 RAZOR2_CF_RANGE_01_10 86 79 7 0.973 0.88 0.36 1.00 ROUND_THE_WORLD_LOCAL 40 37 3 0.975 0.89 0.26 0.01 RAZOR2_CF_RANGE_71_80 41 32 9 0.920 0.74 0.21 2.80 ROUND_THE_WORLD 413 17 396 0.121 0.65 0.26 -10.00 RCVD_IN_BONDEDSENDER 35489 15176 20313 0.706 0.39 0.93 3.09 NO_MX_FOR_FROM 1819 796 1023 0.714 0.35 0.44 0.50 X_NJABL_DIALUP 1779 667 1112 0.659 0.28 0.33 0.62 X_OSIRU_DUL 0 0 0 0.500 0.12 0.07 8.00 HABEAS_VIOLATOR 0 0 0 0.500 0.12 0.07 4.00 HABEAS_HIL 0 0 0 0.500 0.12 0.06 2.62 RCVD_IN_VISI 0 0 0 0.500 0.12 0.06 2.73 X_OSIRU_SPAM_SRC 0 0 0 0.500 0.12 0.06 2.72 X_OSIRU_OPEN_RELAY 0 0 0 0.500 0.12 0.03 0.00 RCVD_IN_BL_SPAMCOP_NET 0 0 0 0.500 0.12 0.03 0.00 RAZOR_CHECK 0 0 0 0.500 0.12 0.03 0.00 RCVD_IN_DUL_FH 0 0 0 0.500 0.12 0.03 0.00 RCVD_IN_RSS 0 0 0 0.500 0.12 0.03 0.00 RCVD_IN_DUL 0 0 0 0.500 0.12 0.03 0.00 RCVD_IN_RBL 5649 1070 4579 0.429 0.08 0.00 0.36 X_OSIRU_DUL_FH 8479 881 7598 0.272 0.02 0.00 0.81 RCVD_IN_MULTIHOP_DSBL
Created attachment 586 [details] Revised hit-frequencies version
Created attachment 587 [details] hit-frequencies-multiple.pl (more flexible filenames)
Subject: Re: [SAdev] New ranking specifically for figuring out which tests to keep/not On Thu, Jan 16, 2003 at 12:43:39PM -0800, bugzilla-daemon@hughes-family.org wrote: > ham-nobayes-net-quinlan2.log (Dan, is this the one that should be used?) Per a message he sent me pre-trip-to-Boston this morning, yes. > 35489 15176 20313 0.706 0.39 0.93 3.09 NO_MX_FOR_FROM Yeah, we need to come up with what timeframe we should be using the netchecks from (I have a lot of ham hits for this rule for instance, all from hosts that haven't existed in X years...) I then need to figure out how to remove net checks from only a certain section of log file based on timestamp (since we want the results in the whole file for the rest of set1 testing). To do so, we'd probably want to split the logs based on the timestamp, check the net results for that section, then do the above removal based on the best looking timestamp. I'm thinking a max of 3 months for net tests, anyone else? We also need to agree on what standards we're going to use to nuke rules before the GA run.
Subject: Re: New ranking specifically for figuring out which tests to keep/not >------- Additional Comments From felicity@kluge.net 2003-01-16 13:14 >Subject: Re: [SAdev] New ranking specifically for figuring out which tests >to keep/not > >On Thu, Jan 16, 2003 at 12:43:39PM -0800, bugzilla-daemon@hughes-family.org >wrote: >> ham-nobayes-net-quinlan2.log (Dan, is this the one that should be used?) > >Per a message he sent me pre-trip-to-Boston this morning, yes. Ah, thank you, I'd forgotten about his Boston trip. >> 35489 15176 20313 0.706 0.39 0.93 3.09 NO_MX_FOR_FROM > >Yeah, we need to come up with what timeframe we should be using the >netchecks from No more than 6 months, I'd say - I actually didn't even _do_ net-checks for stuff before August of 2002. Less than 6 months is possible. >(I have a lot of ham hits for this rule for instance, all >from hosts that haven't existed in X years...) Yes. My sole FP for that particular one is from August of 2002, as it turns out... >I then need to figure out how to remove net checks from only a certain >section of log file based on timestamp (since we want the results in the >whole file for the rest of set1 testing). > >To do so, we'd probably want to split the logs based on the timestamp, >check the net results for that section, ? >then do the above removal based on the best looking timestamp. Ah. > I'm thinking a max of 3 months for net tests, anyone else? Not sure as yet - will have to see the results of different levels of removals. >We also need to agree on what standards we're going to use to nuke rules >before the GA run. Yes. Some combo of old & new ranks, S/O, and considerations of how much of a load on SA the rule-type in question creates - low hits for a high-load rule is one indicator (and the GA _doesn't_ have info on how high the load for a rule is, unlike most of the other info (it also doesn't have some of the info that goes into the new ranking, namely inconsistencies on rules between corpi)). -Allen
Created attachment 591 [details] Patch to hit-frequencies for new ranking + compensating for network tests being removed
Created attachment 592 [details] Version of hit-frequencies-multiple.pl that allows for compensating for network tests being removed
Attached are the results of doing hit-frequencies-multiple.pl -r -N 0.1 -P nobayes-net on the following: ham-nobayes-net-easmith.log ham-nobayes-net-jm.log ham-nobayes-net-kramer.log ham-nobayes-net-martinr.log ham-nobayes-net-mgm.log ham-nobayes-net-mss.log ham-nobayes-net-quinlan2.log ham-nobayes-net-rODbegbie.log ham-nobayes-net-rpuhek.log ham-nobayes-net-theo.log spam-nobayes-net-easmith.log spam-nobayes-net-jm.log spam-nobayes-net-kramer.log spam-nobayes-net-martinr.log spam-nobayes-net-mgm.log spam-nobayes-net-mss.log spam-nobayes-net-quinlan2.log spam-nobayes-net-rODbegbie.log spam-nobayes-net-rpuhek.log spam-nobayes-net-theo.log after removing network tests older than 6 months (see Bug 1365). -Allen
Created attachment 593 [details] New ranking results for mass-check logs (network tests older than 6 months removed)
Created attachment 594 [details] Old ranking (with -A 38887:67613) for mass-check logs with net tests prior to last 6 months removed
Allen, I reckon the name of this stat should change from "RANK" or "RANK2" to "INFO". That's more accurate, as, as you say, it's not a good reflection of its "rank" in the traditional sense, more of how much information it provides (in the Shannon sense if I heard right ;).
Subject: Re: New ranking specifically for figuring out which tests to keep/not >------- Additional Comments From jm@jmason.org 2003-01-19 15:53 >Allen, I reckon the name of this stat should change from "RANK" >or "RANK2" to "INFO". That's more accurate, as, as you say, it's >not a good reflection of its "rank" in the traditional sense, more >of how much information it provides Good point; thanks! I'm currently converting it into rankings mainly because of headaches figuring out other ways of interpreting it than "more is better", at least at the moment... > (in the Shannon sense if I heard right ;). Essentially, although a bit of a scaling adjustment would need to be made to convert it into informational entropy terms, since I simply used log() instead of log-base-2. -Allen
Created attachment 598 [details] Latest version of patch to hit-frequencies - license as per current License file
Created attachment 601 [details] hit-frequencies-multiple.pl - latest version
Since this isn't going to be in 2.50, I'm aiming it at 2.60 instead. :)
old ticket, no movement.