SA Bugzilla – Bug 5711
RFE: "mass-check --reuse" to produce set1 results if possible, set0 otherwise
Last modified: 2007-11-19 14:57:40 UTC
In order to be able to generate set1 scores nightly we need a way to run 'mass-check --net' much faster than it currently runs. In discussions on dev@ [1], we've decided that the best way to do this would be to add a switch, "--reuse-only", which only produces network-rule output for messages where the reused-lookups info is valid. [1]: Subject: "Re: Nightly score generation for all scoresets", Fri 19 Oct 2007 As Daryl said: > I'd settle for a --reuse-only > run that includes all of your messages for set0 results and only > reusable messages for set1 results... all done in a single mass-check. So this would have to produce 4 output files: - ham.log / spam.log = set0 mass-check output, containing set0 mass-check results for all messages - ham-set1.log / spam-set1.log ? = set1 mass-check output, containing only the set1 results for messages where reuseable info was present? maybe there's a better UI for that though... suggestions? for what it's worth, here's the counts: : exit=0 Wed Oct 31 18:00:04 GMT 2007; cd /home/corpus-rsync/corpus : jm 72...; grep reuse=yes spam-net-*.log | wc -l 489909 : jm 73...; grep reuse=no spam-net-*.log | wc -l 105134 : jm 76...; grep reuse=yes ham-*.log | wc -l 66868 : exit=0 Wed Oct 31 18:01:48 GMT 2007; cd /home/corpus-rsync/corpus : jm 77...; grep reuse=no ham-*.log | wc -l 253814 480k spams is pretty good, but 66k hams not so much. We need to improve that I'd say.
I haven't got to this yet; I wasted too much time last week attempting to optimize the mass-check server cache scheduler (I probably should have just used an SQL DB rather than DBM and called it a day). Anyway, when I looked at this quickly last week, I believe that we can generate "reuse only" set0+set1 results using the current code (if it's slightly modified to not do the copy_config stuff it'll be faster though) using the --reuse option without the --net option. I haven't yet tested this, though, as I couldn't do it quickly since I don't have any unmarked mail. Bayes (and thus sets 2+3) can be added fairly easily too (with a bit more code modification) and, again, without the need for a "--reuse-only" option. I'd also like to have the output combined in a single log and to have whatever uses the logs to pick out what it wants to use. It already takes 20-30 minutes to rsync my mass-check logs (and only 40-45 minutes to do the mass-check) so multiple logs with overlapping data isn't too appealing to me.
(In reply to comment #1) > I haven't got to this yet; I wasted too much time last week attempting to > optimize the mass-check server cache scheduler (I probably should have just used > an SQL DB rather than DBM and called it a day). > > Anyway, when I looked at this quickly last week, I believe that we can generate > "reuse only" set0+set1 results using the current code (if it's slightly modified > to not do the copy_config stuff it'll be faster though) using the --reuse option > without the --net option. I haven't yet tested this, though, as I couldn't do > it quickly since I don't have any unmarked mail. ok, that works for me. I could probably provide some mail if you want... > Bayes (and thus sets 2+3) can be added fairly easily too (with a bit more code > modification) and, again, without the need for a "--reuse-only" option. how does that work? if --reuse is stated, it contains the BAYES results too? > I'd also like to have the output combined in a single log and to have whatever > uses the logs to pick out what it wants to use. It already takes 20-30 minutes > to rsync my mass-check logs (and only 40-45 minutes to do the mass-check) so > multiple logs with overlapping data isn't too appealing to me. ok, agreed. I guess that'd mean that if the data was reused, it contains "reuse=yes" and has set1 results; if the data was not available, it contains "reuse=no" and has set0 results.
(In reply to comment #2) > ok, that works for me. I could probably provide some mail if you want... I'll just run some through spamassassin -d. I'm just short on time right now. I'm actually working on the NetCache stuff right now... which is a bigger pay off than this. > how does that work? if --reuse is stated, it contains the BAYES results too? Yeah, I see no reason why we shouldn't re-use bayes results. > ok, agreed. I guess that'd mean that if the data was reused, it contains > "reuse=yes" and has set1 results; if the data was not available, it contains > "reuse=no" and has set0 results. Something like that. When adding in bayes too, we might want to use set3. I think adding a set=x to the result line would probably be best.
r595757 is a first step towards this. if --net is omitted but --reuse is present, it uses set1/set3, but zeroes all net rules (whether they are reuse rules or not) *and* the reuse rules that are non-net (in case there are any). however, I see DNS lookups in the debug log, probably from URIDNSBL, even though those rules should all have 0 scores. this is probably a bug...
(In reply to comment #4) > r595757 is a first step towards this. er, r595759
I have a fix for this -- will check in later
: jm 124...; svn commit -m "bug 5711: allow 'mass-check --reuse' without '--net' to reuse net-rule hits, and output mass-check results for scoreset 1; while lines that are not reusable use set 0. Also, fix a few tests to use 'tflags net' if they use network lookups (including calls to lookup_ptr().) Fix nightly mass-checks on the zone to use --reuse to gain this." Sending build/nightlymc/corpus.doc Sending build/nightlymc/corpus.fredt Sending build/nightlymc/corpus.jm Sending build/nightlymc/corpus.zmi Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Sending masses/mass-check Sending rules/20_fake_helo_tests.cf Sending rules/20_head_tests.cf Sending rules/25_spf.cf Transmitting file data ......... Committed revision 596095. so mass-check log lines, if run with --reuse, but not --net, now will be run with either set 0 or set 1 depending on whether there were net rule hits to reuse, and will also contain a "set=0" or "set=1" to indicate that for greppability. As part of the process, I debugged this on a disconnected machine to track down net rules that weren't declared as tflags net, and made a few additional tests into "tflags net" rules: - SPF_* (several of these were missing 'tflags net') - FAKE_HELO_* (they all use lookup_ptr() under some circumstances) - ROUND_THE_WORLD_LOCAL (ditto -- despite the name!) maybe the latter should be renamed, but I'm not bothered. Alternatively, maybe some of them should simply be deleted, since the results are kinda crappy. I've opened that issue as bug 5726.
marking FIXED
btw, watch out -- if bayes isn't disabled in the mass-checks, autolearning will cause the "set=0" and "set=1" in the logs to become "set=2" and "set=3".