Bug 5376 - RFE: generate a "SpamAssassin Challenge" score-generation test
RFE: generate a "SpamAssassin Challenge" score-generation test
Status: NEW
Product: Spamassassin
Classification: Unclassified
Component: Masses
SVN Trunk (Latest Devel Version)
Other other
: P5 enhancement
: Undefined
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks: 4560
  Show dependency tree
 
Reported: 2007-03-11 06:35 UTC by Justin Mason
Modified: 2007-08-22 04:42 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-03-11 06:35:10 UTC
Talking to Henry, he suggested that we should define an RFP for the
SpamAssassin score-generation tool (ie. something like the GA or the
perceptron).

Thinking about this, I thought -- why not go bigger than that?

The Netflix Prize http://www.netflixprize.com/ is a machine-learning challenge
which 'seeks to substantially improve the accuracy of predictions about how
much someone is going to love a movie based on their movie preferences.'

Basically they provide a training set of data, and some basic rules (
http://www.netflixprize.com/rules ).  Third parties can then come up with
algorithms and implementations based on that data, and provide a set of
"scores" which Netflix then run on a hidden "test set" to estimate their
accuracy.  Netflix then publishes the results, and awards a prize for
the best-performing team.

We could do something similar for SpamAssassin... probably not with
a prize at the end of it, but still, I'd say there'd be some interest.

(Our rules would include stuff like score ranges: ie. low-hit-rate
rules and high-FP rules cannot have scores over N; rules that are
forgeable cannot have negative scores; etc.)
Comment 1 Justin Mason 2007-07-03 07:40:42 UTC
ok, I wrote up a spec of what we currently require:

http://wiki.apache.org/spamassassin/SpamAssassinChallenge

does this look like a reasonable idea?  Duncan?
Comment 2 Michael Parker 2007-07-03 08:23:29 UTC
Should there be something in there about code that can be used without any sort
of patent issues and available under the Apache License?
Comment 3 Theo Van Dinter 2007-07-03 08:27:25 UTC
I think our current "generate scores once a millenia" method is not really
functional for several reasons, but mainly that it takes too long between
updates and the scores are overly tuned for our messages and not necessarily
what other people see (despite having 42 rsync accounts, we have few (8 (but
really more like 6.5) at current count) people doing nightly runs, and I'm ~50%
of the total results).

Also, the last thing I saw about the perceptron, and why it didn't work for 3.2,
was that it needed more diverse data to score with.  Perhaps we just need better
rules?

Anyway, coming up with a new scoring algorithm isn't bad, I just don't think
it's going to make a lot of difference without other changes as well, such as
more frequent runs, more corpus runs/results from the masses, and/or a way for
individual locations to easily update their own scores.
Comment 4 Justin Mason 2007-07-03 08:43:55 UTC
(In reply to comment #3)
> I think our current "generate scores once a millenia" method is not really
> functional for several reasons, but mainly that it takes too long between
> updates and the scores are overly tuned for our messages and not necessarily
> what other people see (despite having 42 rsync accounts, we have few (8 (but
> really more like 6.5) at current count) people doing nightly runs, and I'm ~50%
> of the total results).
> 
> Also, the last thing I saw about the perceptron, and why it didn't work for 3.2,
> was that it needed more diverse data to score with.  Perhaps we just need better
> rules?

well, the GA did pretty well, so we came to the conclusion that the data was ok
-- just not the kind of data the perceptron's gradient-descent algorithm did
well with.

> Anyway, coming up with a new scoring algorithm isn't bad, I just don't think
> it's going to make a lot of difference without other changes as well, such as
> more frequent runs, more corpus runs/results from the masses, and/or a way for
> individual locations to easily update their own scores.

sure.  The benefit of this, however, is that we can define the problem and then
*other people* will do it for us ;)

Michael, good point.
Comment 5 Duncan Findlay 2007-07-04 00:25:26 UTC
Unfortunately it's been a while since I've looked at this stuff. (Actually, it's
been like 3 months... which is hardly a while, but it's been a busy 3 months...)

In no particular order:

- BAYES_* are marked as immutable right now (IIRC). This really limits
optimization in score sets 2 and 3.

- Score ranges need to be better defined. (Perhaps require that entries fit in
current score ranges?) If we don't clearly define/restrict score range, the best
submission will probably be the one with the least restricted scores. Score
ranges prevent scores from being over-optimized to our data set. Splitting our
data set into training and test sets doesn't really catch this
over-optimization, since both are part of our data set that has unique
characteristics. (I'm sure there are technical terms for this, I just don't
remember what they are...)

- I already have a copy of the test set (if I can find it). Does that make me
ineligible? :-)

- By requiring scores in the current format, we are eliminating a whole class of
scoring systems. For example, suppose I wanted to try a decision tree system to
detect spam based on SpamAssassin rules (this would obviously work very poorly),
it would be impossible to convert this into a set of scores.
  - The LR experiments Steve and I did relied on a logarithmic decision rule
(i.e. a message is spam if 1 / 1 + exp^(-(scores * rule_hits)) > probability
threshold). This is easy to convert into traditional SpamAssassin scores using
algebra, but other systems may not be.
  - If we scrap the requirements for output to be in terms of current
SpamAssassin scores, our score ranges problem becomes more significant -- score
ranges don't mean anything if we're not talking about traditional SpamAssassin
scores.
  - Ask me if this isn't clear -- it's tricky to explain.

- Our evaluation criteria is currently undefined. We need a clear, single
measurement to decide on a winner. (In our research, we used TCR on the test set
with lambda = 50 as our "goal" criteria.) Depending on how/if we resolve the
previous point, we need to set a threshold value (for example 5.0) as our sole
test point.

- Do you think people are actually going to be interested in this enough in
order to devote a good chunk of time toward it? I hope so...

Makes me think I should have submitted a talk to ApacheCon... it'd be a great
way to kick off this contest.
Comment 6 Duncan Findlay 2007-07-04 00:33:08 UTC
Oh, also, should we have any requirements on runtime? How automated the process
needs to be when it is submitted. Our experiments in LR, for example, contain a
fairly time consuming manual step right now :-) This sort of thing could
probably be worked out after we select a system, but guidelines might be good here.
Comment 7 Justin Mason 2007-07-04 04:32:38 UTC
(In reply to comment #5)
> - BAYES_* are marked as immutable right now (IIRC). This really limits
> optimization in score sets 2 and 3.

I think we do want to keep them immutable, though -- when we made them
mutable, it caused lots of user confusion and FAQs.  it makes the rescoring
job a little harder, but dealing with user complaints is a real pain!

> - Score ranges need to be better defined. (Perhaps require that entries fit in
> current score ranges?)

I dunno, I don't think the current ones are all that great anyway ;) I'd bet
people could do better.  We've certainly tweaked them repeatedly ourselves
in the past... ;)

I take your point about wanting to block submissions with totally unrestricted
scores though; I tried to do that with what's in the current doc.  Can you
suggest wording that goes a bit further?

> If we don't clearly define/restrict score range, the best
> submission will probably be the one with the least restricted scores. Score
> ranges prevent scores from being over-optimized to our data set. Splitting our
> data set into training and test sets doesn't really catch this
> over-optimization, since both are part of our data set that has unique
> characteristics. (I'm sure there are technical terms for this, I just don't
> remember what they are...)

Maybe we'd need to use different submitter accounts for the train and test
sets?  So the test set is made up of mail from an entirely different
submitter or set of submitters?  We could do that...

> - I already have a copy of the test set (if I can find it). Does that make me
> ineligible? :-)

Just don't look at it ;)

Seriously though, I don't know how we could avoid that problem.  By definition,
any member of the SpamAssassin PMC can look at the test set if they want, due
to how the project governance and oversight works.  All we can do is work on
trust I think.

> - By requiring scores in the current format, we are eliminating a whole class of
> scoring systems. For example, suppose I wanted to try a decision tree system to
> detect spam based on SpamAssassin rules (this would obviously work very poorly),
> it would be impossible to convert this into a set of scores.

I think this is acceptable.

I don't think we would want to replace the current scoring system with a more
complex system built around Bayesian probability combining, decision trees, or
anything else, without a *lot* more analysis and discussion, whereas this
"Challenge" is more likely to produce something like the perceptron; a drop-in
replacement for the offline rescoring component (I hope).

> - Our evaluation criteria is currently undefined. We need a clear, single
> measurement to decide on a winner. (In our research, we used TCR on the test set
> with lambda = 50 as our "goal" criteria.) Depending on how/if we resolve the
> previous point, we need to set a threshold value (for example 5.0) as our sole
> test point.

Well, I had a thought during the 3.2.0 scoring.  We actually have two
constraints:

  - results have to be below a certain FP% rate (around 0.4%) if at all
    possible (we wound up going over this for set 0, but we had no
    alternative :( )

  - and as low an FN% rate as possible, given the above.

So in other words, that's two constraints: TCR *and* a cut-off point for the
FP% rate.  does that sound ok?

And for purposes of the challenge -- obviously, it'd need to be better than
what we currently get, too. ;)

> - Do you think people are actually going to be interested in this enough in
> order to devote a good chunk of time toward it? I hope so...

it'd make a good project I think!  I've received several emails in the
past about the idea, including Gordon's most recently, and it'd be natural
way for machine-learning students to get Google Summer of Code funding.

> Makes me think I should have submitted a talk to ApacheCon... it'd be a great
> way to kick off this contest.

yeah definitely...

> Oh, also, should we have any requirements on runtime? How automated the
> process needs to be when it is submitted. Our experiments in LR, for example,
> contain a fairly time consuming manual step right now :-) This sort of thing
> could probably be worked out after we select a system, but guidelines might
> be good here.

Good point.  It'd need to be "fire and forget" automated, as much as the GA is
right now.  Hand-tweaking is hard work, timeconsuming, and error-prone
in my experience...

Comment 8 Duncan Findlay 2007-07-04 20:18:33 UTC
Re: Bayes rules.
No, they should not be immutable. If you want, we can require them to be "sane"
for some definition of sane. There's no compelling reason for them to be the
exact values they are.

Re: Specifying Score ranges
This is pretty tricky, I'm not sure what we can do here. The problem is that
score ranges are inherently non-mathematical, really more of a shot in the dark,
and there's no real way to evaluate them. Having a different corpus form the
test set is probably a better real-world test of the algorithm, but it also adds
a whole lot of luck (I think). If we were evaluating *algorithms* rather than
submitted sets of scores, we could try something like a cross-fold validation
but split into folds based on corpus as you suggest. I don't know if it would
work in practice (or even in theory for that matter).

Re: Evaluation
I'm not entirely sure what you're trying to say. Specifying a max FP rate and
minimizing FNs given that rate is not the same as minimizing TCR. TCR builds in
a relative cost of FPs and FNs (namely lambda), and is probably a simpler
criteria. I don't think we'll see really high FPs if we are trying to obtain
optimal TCR with lambda = 50.

(Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN =
0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).)

Re: Fire and Forget
In development of these algorithms, it's much easier to get a process down then
automate it later. If we don't want to see/evaluate the results before requiring
people to have a fully automated system, we might be wasting effort.
Comment 9 Justin Mason 2007-07-06 04:58:20 UTC
(In reply to comment #8)
> Re: Bayes rules.
> No, they should not be immutable. If you want, we can require them to be "sane"
> for some definition of sane. There's no compelling reason for them to be the
> exact values they are.

Possibly not the exact values they are right now.  But I think we have to
disagree that they need to be immutable; I really would prefer that they are. 
Other developers' thoughts would be welcome here...

> Re: Specifying Score ranges
> This is pretty tricky, I'm not sure what we can do here. The problem is that
> score ranges are inherently non-mathematical, really more of a shot in the dark,
> and there's no real way to evaluate them. Having a different corpus form the
> test set is probably a better real-world test of the algorithm, but it also adds
> a whole lot of luck (I think). If we were evaluating *algorithms* rather than
> submitted sets of scores, we could try something like a cross-fold validation
> but split into folds based on corpus as you suggest. I don't know if it would
> work in practice (or even in theory for that matter).

maybe that would be a good additional test step.  I agree 10fcv is
really the best way to check out an algorithm's workability...


> Re: Evaluation
> I'm not entirely sure what you're trying to say. Specifying a max FP rate and
> minimizing FNs given that rate is not the same as minimizing TCR. TCR builds in
> a relative cost of FPs and FNs (namely lambda), and is probably a simpler
> criteria. I don't think we'll see really high FPs if we are trying to obtain
> optimal TCR with lambda = 50.
>
> (Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN =
> 0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).)

ah, here's another problem with TCR I'd forgotten about.
it varies based on the size of the corpora:

: jm 372...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 1000 -ham 1000
# TCR(l=50):                  33.500000
: jm 373...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 2000 -ham 2000
# TCR(l=50):                  66.833333
: jm 374...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 3000 -ham 3000
# TCR(l=50):                  100.166667

(I can't believe I forgot about that!)
I think we should avoid it in general; if we can't compare results because
the corpus changes size, that's a bad thing.  I know it's nice to have a
single figure, but I haven't found a good one yet; nothing that's as
comprehensible or easy to use as FP%/FN%.


> Re: Fire and Forget
> In development of these algorithms, it's much easier to get a process down then
> automate it later. If we don't want to see/evaluate the results before requiring
> people to have a fully automated system, we might be wasting effort.

ok, maybe so.
Comment 10 Duncan Findlay 2007-07-06 13:43:43 UTC
Re: Bayes immutability

All I'm really trying to say is that during scoring runs, we should be changing
the BAYES_ scores. We can manually make them "sane" if necessary, but we need to
change the values periodically to reflect the changing importance of the Bayes
rules relative to other rules. (In our LR research, we would have given BAYES_99
a score of >6 if we could have, so realistically a score of 4.5 would be fair
for BAYES_99. The best way to determine what it should be is using a scoring
mechanism.)

I can agree to disagree.

Re: TCR

TCR = number of spam / (number of fns + lambda * number of fps)

If you remember what TCR represents, it makes sense that TCR depends on the
relative ratio of ham/spam of the corpus. (If you don't,
http://wiki.spamassassin.org/TotalCostRatio -- but the wiki page really
complicates the calculation) It's still fine for ranking different algorthims on
the same corpus, but if you're upset about it, I propose the following new
measurement. Let's call it the Findlay measurement:

F(lambda) = 1 / (FN% + FP% * lambda)

(Actually as defined above it's a function.)

This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the
"undesirable" properties you mentioned.

(Ok, don't call it the Findlay measurement... it's a stupid name, and it's a
fairly trivial derivation...)
Comment 11 Loren Wilton 2007-07-06 15:06:25 UTC
From a user point of view Bayes doesn't need to be immutable, but it does need 
to be monotonic, and the 99 rank needs to be a major fraction of the spam 
score, and 00 should be pretty much the negative of that.  So having bayes_99 
be 6.5 or 4.5 or 8 is fine, but having it be 3.2 or 2.6 isn't fine.

Of course from a computational point of view a score range restriction like 
that is probably worse to deal with than a fixed score, since it guarantees you 
an infinite number of possible solutions.  But from a human point of view it is 
a nice concept.
Comment 12 Justin Mason 2007-08-01 04:59:27 UTC
here's another machine-learning "Challenge" --

  http://challenge.spock.com/pages/learn_more

$50k prize on this one.  I doubt we could match that ;)
Comment 13 Justin Mason 2007-08-07 10:46:14 UTC
(In reply to comment #10)
> I propose the following new
> measurement. Let's call it the Findlay measurement:
> 
> F(lambda) = 1 / (FN% + FP% * lambda)
> 
> (Actually as defined above it's a function.)
> 
> This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the
> "undesirable" properties you mentioned.

it's not bad, but it needs work I'm afraid.  For example,
take a look at this:

  fn=3 fp=1.5 f=0.0128205128205128
  fn=18 fp=1.2 f=0.0128205128205128

in other words, if the FP% in this case dropped from 1.5% to 1.2%, that would
be enough to disguise the FN% rate dropping from 3% to *18%*. ;)

Comment 14 Duncan Findlay 2007-08-07 11:07:33 UTC
That's exactly right, one FP% is worth 50 FN%, if you set lambda = 50. If you
don't like that tradeoff, pick a different value for lambda. :-)

I agree it gets ugly when you have that many FPs...
Comment 15 Duncan Findlay 2007-08-07 11:25:15 UTC
I should point out also that where I wrote "FN%" and "FP%" I really meant the
proportion of FN/FPs and not percentage, so you need to divide the percentage by
100 before sticking it in the formula (or multiply the end result by 100). You
end up with F=1.28, not F=0.0128.
Comment 16 Mark Martinec 2007-08-07 12:51:06 UTC
Btw, the CEAS 2007 contest used the following metrics:

  Filters will be evaluated using the lam() metric. Lam() calculates the
  average of a filter's false-negative and false-positive rates, but performs
  the calculation in logit space. The exact formula used for the competition
  will be: 
    FPrate = (#ham-errors  + 0.5) / (#ham  + 0.5)
    FNrate = (#spam-errors + 0.5) / (#spam + 0.5)
    lam = invlogit((logit(FPrate) + logit(FNrate)) / 2)
  where 
    logit(p) = log(p/(1-p))
    invlogit(x) = (exp(x))/(1 + exp(x))
 The winner will be the filter with the lowest lam() score.


Here is an additional comment from Gordon V. Cormack
(on the ceas-challenge@eight.pairlist.net ML):


Lam is equivalent to diagnostic odds ratio which is
used extensively in the diagnostic test literature.

Intuitively, it rewards a fixed factor improvement
in either spam or ham missclassification rate by
the same amount.  So improving from .001 to .0009
fp gives the same score gain as from .01 to .009 fn.

Lam is *almost* the same thing as the geometric
mean of the two scores, which is perhaps more
intuitive for most.

Lam is discussed in the TREC 2005 proceedings
  http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/

and DOR in the medical literature
  http://www.bmj.com/cgi/content/full/323/7305/157

In a nutshell, we are looking for a threshold-
independent measure that roughly captures the
overall discriminative ability of a filter.  It
turns out that lam works pretty well for this.
Better than, say "accuracy" which is wildly
threshold dependent; optimizing accuracy may
be totally inconsistent with good filtering.
While lam does not explicitly reward low fp
more than low fn, it does not actively
discourage it.

Other measures -- like ROC Area Under the
Curve -- could have been used but that would
have required filters to return scores rather
than categorical ham/spam judgements and we
considered this logistically unrealistic.
Comment 17 Justin Mason 2007-08-07 13:51:43 UTC
lam sounds promising -- let's look into that.

btw there's another issue with the F(lambda) idea -- if FP and FN
are 0 -- ie there were no misclassifications, which can happen on
small test corpora -- it yields a division by zero.
Comment 18 Sidney Markowitz 2007-08-07 23:47:11 UTC
> if FP and FN are 0 -- ie there were no misclassifications [...]
>  -- it yields a division by zero

That's an idea! Implement it and spammers will start producing perfectly
classifiable spam in an attempt to crash the filters with divide by zero errors.
We can foil them by leaving in a small bug to make perfection impossible, but
stay as close to perfect as we want!
Comment 19 Duncan Findlay 2007-08-11 01:30:04 UTC
Re: divide by 0
That's a computational issue, not a theoretical one. If there are no FPs or FNs,
then it makes sense for TCR (or F) to be infinite.

You just need to remember what TCR is, a ratio between the costs of dealing with
spam without and with a spam filter. (http://wiki.spamassassin.org/TotalCostRatio)
Comment 20 Justin Mason 2007-08-11 04:15:16 UTC
(In reply to comment #19)
> Re: divide by 0
> That's a computational issue, not a theoretical one. If there are no FPs or FNs,
> then it makes sense for TCR (or F) to be infinite.

it may make sense from the POV of TCR as an abstract concept, but it doesn't
make much sense for us to use it if it breaks like that. ;)
Comment 21 Justin Mason 2007-08-14 05:32:26 UTC
lam() avoids the problem noted in comment 13, but has another problem; it
doesn't have any concept of an FP being worse than an FN.  This means that e.g.
fp%=1.5 fn%=5 and fp%=5 fn%=1.5 have the same score, whereas the latter is much
worse for our users.
Comment 22 Justin Mason 2007-08-14 06:03:49 UTC
here's a version of lam() with a lambda calculation...

#!/usr/bin/perl
# http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376#c16
my ($lambda, $fppc, $fnpc, $nspam, $nham) = @ARGV;
my $fprate = ((($fppc * $nham) / 100) + 0.5) /  ($nham + 0.5);
my $fnrate = ((($fnpc * $nspam) / 100) + 0.5) / ($nspam + 0.5);
sub logit { my $p = shift; return log($p / (1-$p)); }
sub invlogit { my $x = shift; return exp($x) / (1 + exp($x)); }
my $llam = invlogit (($lambda * logit($fprate) + logit($fnrate)) / ($lambda + 1));
print "Llam(l=$lambda, fp=$fppc, fn=$fnpc, ns=$nspam, nh=$nham): $llam\n";



some results:

Llam(l=10, fp=1, fn=5, ns=10000, nh=10000): 0.011653428823489
Llam(l=10, fp=2, fn=5, ns=10000, nh=10000): 0.0218100293576761
Llam(l=10, fp=5, fn=1, ns=10000, nh=10000): 0.0433911604792563

so it avoids the problem with original lam().  however:

Llam(l=10, fp=1.5, fn=5, ns=10000, nh=10000): 0.0168115579465237
Llam(l=10, fp=1.0, fn=20, ns=10000, nh=10000): 0.0134020941849576

I think this is a problem.  IMO a 20% FN rate/1.0% FP rate should not score
better than 5% FN/1.5% FP.  it's good drop of the FP rate, sure -- but a filter
with 20% FNs is unusable. :(

so:

  TCR: varies widely based on size of corpora
  F(): good FP rates can mask terrible FN rates
  lam(): treats FPs and FNs as equal, no concept of lambda
  Llam(): again, good FP rates can mask terrible FN rates

we still don't have a good single-figure metric imo.
Comment 23 Justin Mason 2007-08-22 04:42:10 UTC
I think we should go ahead with using FP% and FN% -- a double-figure metric --
instead of any single figure metric.  I don't think any of the single-figure
metrics perform well enough in all situations to match FP%/FN%'s expressiveness.