Bug 5711 - RFE: "mass-check --reuse" to produce set1 results if possible, set0 otherwise
Summary: RFE: "mass-check --reuse" to produce set1 results if possible, set0 otherwise
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Masses (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-31 11:02 UTC by Justin Mason
Modified: 2007-11-19 14:57 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-10-31 11:02:37 UTC
In order to be able to generate set1 scores nightly we need a way to run
'mass-check --net' much faster than it currently runs.  In discussions on dev@ [1],
we've decided that the best way to do this would be to add a switch,
"--reuse-only", which only produces network-rule output for messages where the
reused-lookups info is valid.

[1]: Subject: "Re: Nightly score generation for all scoresets", Fri 19 Oct 2007

As Daryl said:

> I'd settle for a --reuse-only 
> run that includes all of your messages for set0 results and only 
> reusable messages for set1 results... all done in a single mass-check.

So this would have to produce 4 output files:

  - ham.log / spam.log = set0 mass-check output, containing set0 mass-check
results for all messages

  - ham-set1.log / spam-set1.log ? = set1 mass-check output, containing only the
set1 results for messages where reuseable info was present?

maybe there's a better UI for that though... suggestions?


for what it's worth, here's the counts:

: exit=0 Wed Oct 31 18:00:04 GMT 2007; cd /home/corpus-rsync/corpus
: jm 72...; grep reuse=yes spam-net-*.log | wc -l
  489909
: jm 73...; grep reuse=no spam-net-*.log | wc -l
  105134
: jm 76...; grep reuse=yes ham-*.log | wc -l
   66868
: exit=0 Wed Oct 31 18:01:48 GMT 2007; cd /home/corpus-rsync/corpus
: jm 77...; grep reuse=no ham-*.log | wc -l
  253814

480k spams is pretty good, but 66k hams not so much.  We need to
improve that I'd say.
Comment 1 Daryl C. W. O'Shea 2007-11-01 13:41:39 UTC
I haven't got to this yet; I wasted too much time last week attempting to
optimize the mass-check server cache scheduler (I probably should have just used
an SQL DB rather than DBM and called it a day).

Anyway, when I looked at this quickly last week, I believe that we can generate
"reuse only" set0+set1 results using the current code (if it's slightly modified
to not do the copy_config stuff it'll be faster though) using the --reuse option
without the --net option.  I haven't yet tested this, though, as I couldn't do
it quickly since I don't have any unmarked mail.

Bayes (and thus sets 2+3) can be added fairly easily too (with a bit more code
modification) and, again, without the need for a "--reuse-only" option.

I'd also like to have the output combined in a single log and to have whatever
uses the logs to pick out what it wants to use.  It already takes 20-30 minutes
to rsync my mass-check logs (and only 40-45 minutes to do the mass-check) so
multiple logs with overlapping data isn't too appealing to me.
Comment 2 Justin Mason 2007-11-12 15:39:41 UTC
(In reply to comment #1)
> I haven't got to this yet; I wasted too much time last week attempting to
> optimize the mass-check server cache scheduler (I probably should have just used
> an SQL DB rather than DBM and called it a day).
> 
> Anyway, when I looked at this quickly last week, I believe that we can generate
> "reuse only" set0+set1 results using the current code (if it's slightly modified
> to not do the copy_config stuff it'll be faster though) using the --reuse option
> without the --net option.  I haven't yet tested this, though, as I couldn't do
> it quickly since I don't have any unmarked mail.

ok, that works for me.  I could probably provide some mail if you want...

> Bayes (and thus sets 2+3) can be added fairly easily too (with a bit more code
> modification) and, again, without the need for a "--reuse-only" option.

how does that work?  if --reuse is stated, it contains the BAYES results too?

> I'd also like to have the output combined in a single log and to have whatever
> uses the logs to pick out what it wants to use.  It already takes 20-30 minutes
> to rsync my mass-check logs (and only 40-45 minutes to do the mass-check) so
> multiple logs with overlapping data isn't too appealing to me.

ok, agreed.  I guess that'd mean that if the data was reused, it contains
"reuse=yes" and has set1 results; if the data was not available, it contains
"reuse=no" and has set0 results.
Comment 3 Daryl C. W. O'Shea 2007-11-12 15:55:16 UTC
(In reply to comment #2)
> ok, that works for me.  I could probably provide some mail if you want...

I'll just run some through spamassassin -d.  I'm just short on time right now. 
I'm actually working on the NetCache stuff right now... which is a bigger pay
off than this.

> how does that work?  if --reuse is stated, it contains the BAYES results too?

Yeah, I see no reason why we shouldn't re-use bayes results.

> ok, agreed.  I guess that'd mean that if the data was reused, it contains
> "reuse=yes" and has set1 results; if the data was not available, it contains
> "reuse=no" and has set0 results.

Something like that.  When adding in bayes too, we might want to use set3.  I
think adding a set=x to the result line would probably be best.
Comment 4 Justin Mason 2007-11-16 09:54:52 UTC
r595757 is a first step towards this.  if --net is omitted but --reuse is
present, it uses set1/set3, but zeroes all net rules (whether they are reuse
rules or not) *and* the reuse rules that are non-net (in case there are any).

however, I see DNS lookups in the debug log, probably from URIDNSBL, even though
those rules should all have 0 scores.  this is probably a bug...
Comment 5 Justin Mason 2007-11-16 09:55:18 UTC
(In reply to comment #4)
> r595757 is a first step towards this. 

er, r595759
Comment 6 Justin Mason 2007-11-18 07:55:42 UTC
I have a fix for this -- will check in later
Comment 7 Justin Mason 2007-11-18 12:38:56 UTC
: jm 124...; svn commit -m "bug 5711: allow 'mass-check --reuse' without '--net'
to reuse net-rule hits, and output mass-check results for scoreset 1; while
lines that are not reusable use set 0.  Also, fix a few tests to use 'tflags
net' if they use network lookups (including calls to lookup_ptr().)  Fix nightly
mass-checks on the zone to use --reuse to gain this."
Sending        build/nightlymc/corpus.doc
Sending        build/nightlymc/corpus.fredt
Sending        build/nightlymc/corpus.jm
Sending        build/nightlymc/corpus.zmi
Sending        lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
Sending        masses/mass-check
Sending        rules/20_fake_helo_tests.cf
Sending        rules/20_head_tests.cf
Sending        rules/25_spf.cf
Transmitting file data .........
Committed revision 596095.


so mass-check log lines, if run with --reuse, but not --net, now will be run
with either set 0 or set 1 depending on whether there were net rule hits to
reuse, and will also contain a "set=0" or "set=1" to indicate that for
greppability.

As part of the process, I debugged this on a disconnected machine to track down
net rules that weren't declared as tflags net, and made a few additional tests
into "tflags net" rules:

- SPF_* (several of these were missing 'tflags net')

- FAKE_HELO_* (they all use lookup_ptr() under some circumstances)

- ROUND_THE_WORLD_LOCAL (ditto -- despite the name!)

maybe the latter should be renamed, but I'm not bothered.
Alternatively, maybe some of them should simply be deleted, since
the results are kinda crappy.  I've opened that issue as bug 5726.
Comment 8 Justin Mason 2007-11-18 12:39:38 UTC
marking FIXED
Comment 9 Justin Mason 2007-11-19 14:57:40 UTC
btw, watch out -- if bayes isn't disabled in the mass-checks, autolearning will
cause the "set=0" and "set=1" in the logs to become "set=2" and "set=3".