Bug 4095 - Using Bayesian Filters to score rules
Summary: Using Bayesian Filters to score rules
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: unspecified
Hardware: Other other
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4089 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-01-22 10:00 UTC by Marc Perkel
Modified: 2005-01-24 01:25 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Perkel 2005-01-22 10:00:28 UTC
I want to throw out a thought. I think we can get rid of scores for rules and
let a bayesian filter do automatic scoring.

Here's how it would work. We keep the rules with an indication as to if the rule
is initially a black or white rule. As the initial messages come in they are
evaluated against the rules and a list of triggered rules are fed into a
SEPARATE bayesian filter that is only used to score rules. If the message is
sufficiently extreme ham or spam then the message is autolearned by ALL the
bayesian filters. 

Once the system is trained then the bayesian filter for the rules is what
generates the score.

We also have to rethink the idea of scores because scores will be a fraction
between 0 and 1 instead of points that are added or subtracted. The result is
not a yes or no but rather a fraction that indicates how spammy or hammy the
message is. This result can be used to decide what to do with the message.

On my system I have got away from the "this is spam" model. I have many
classifications.

ham - autolearned
nonspam - not spam - but not sure enough to autolearn
low-spam - this is probably spam - but a few false positives end up here
high-spam - these messages are bounce to the sender - autolearned
veryhigh-spam - these message are just dropped so as not to become bounce spam -
autolearned
pure-trash - I drop these at connect time

The idea is that the scoring of these rules are automatic based on the
reliability of the hits on these rules - and - the score varies from server to
server based on the kind of spam and ham received. After the filter is trained
you can write any rule you want and if you write a good rule - it will develop a
good score. Rules that score in the middle can be automatically culled.

This bayesian filter is separate and apart from the other bayesian filters. The
other bayesian filters report to this filter with their (fractional) results and
are also automatically evaluated.

I'm getting lose to 99.9% accuracy with the tricks I'm doing so far. This one
can really kisk the accuracy up there - but it requires a significal shift in
the way you think about spam and scoring.
Comment 1 Marc Perkel 2005-01-22 10:13:11 UTC
BTW - if we can come up with a way to do it - I'd love to challenge anyone here
to a spam filtering contest.

Marc Perkel
Comment 2 Daniel Quinlan 2005-01-22 14:20:28 UTC
Thanks for your suggestions, but we really need working SA code to evaluate
something elaborate like this.  Similar ideas have been discussed before, but
I don't believe the improvement was statistically significant.  And no code...
Comment 3 Marc Perkel 2005-01-22 14:31:31 UTC
Subject: Re:  Using Bayesian Filters to score rules

I know it would take a massive change - but I believe it will be worth it.

I have been testing using a second bayesian filter and the results are 
incredable. The second filter is spamprobe - but I'm feeding it just the 
headers and from the body of the message only the links, phone numbers, 
and email addresses. And it is close to 100% accurate especially 
wherethe normal bayes filter fails. It is working so well that I believe 
that this could easilly be adapted to score only rules.

I know it's going to take time to warm up to the idea. I figure if I 
start talking about it now that in two years everyone will finally get 
it. ;)

But - it is a different way of thinking about things - and - you do have 
to roll it around in your mind for quite a while before you see the big 
picture. I think I could probably cobble something together out of 
spamprobe and a little perl to try it out. I'll let you all know if I 
get it working. If I do - then you smart folks can go back and do it right.

Comment 4 Loren Wilton 2005-01-22 21:12:15 UTC
Subject: Re:   New: Ising Bayesian Filters to score rules

> Once the system is trained then the bayesian filter for the rules is what
> generates the score.

1.    How do you decide when the system is sufficiently trained to switch
over to scoring this way?

2.    What do you use for scoring until the switchover?

3.    This implies that you have rules for most all of the things that are
interesting spam (and ham) signs.  This is a lot of work maintaining rules,
which currently isn't as mandatory with the current Bayes setup.

4.    If you don't have a rule for something then you can't score it.
Currently Bayes effectively generates the equivalent of its own rule for
that new spam sign.

Comment 5 Marc Perkel 2005-01-22 22:03:48 UTC
Subject: Re:  Using Bayesian Filters to score rules



bugzilla-daemon@bugzilla.spamassassin.org wrote:

>http://bugzilla.spamassassin.org/show_bug.cgi?id=4095
>
>
>
>
>
>------- Additional Comments From lwilton@earthlink.net  2005-01-22 21:12 -------
>Subject: Re:   New: Ising Bayesian Filters to score rules
>  
>
>1.    How do you decide when the system is sufficiently trained to switch
>over to scoring this way?
>  
>
What I envision is that rules are listed as white or black initially. No 
scores are assigned. Then the initial message are tested and they will 
return a list of rules triggered. If the message triggers say 3/4 black 
rules or more then it is learned as spam. If it triggers 3/4 white rules 
- then it is learned as ham. Once the learning process begins then the 
rules themselves develop scores just like bayesian tokens develop 
scores. As the system learns the scoring takes cate of itself. Messages 
end up with a number between 0 an 1 and then you just have to figure out 
where you want to call it spam and what to do with it.

>2.    What do you use for scoring until the switchover?
>  
>
See above.

>3.    This implies that you have rules for most all of the things that are
>interesting spam (and ham) signs.  This is a lot of work maintaining rules,
>which currently isn't as mandatory with the current Bayes setup.
>  
>
We still use the same rules we have now - we just don't need to score 
them. If the rule is triggered then the name of the rule goes into the 
bayesian filter.

>4.    If you don't have a rule for something then you can't score it.
>Currently Bayes effectively generates the equivalent of its own rule for
>that new spam sign.
>
>
>
>  
>
Yes - bayes generates its own scores off of tokens. And we still have 
bayesian filters that look at the message. But what I'm saying is that 
we also habe another bayesian filter that is fed only a list of rules 
that it triggered and it does the final scoring. And - some of those 
rules in the list include the results of other bayesian filters looking 
at the message.

I hope I'm not losing everyone in this concept. It's really hard to get 
the big picture into words.

Comment 6 Justin Mason 2005-01-24 10:21:04 UTC
I looked into multiple levels of Bayesian filters, and eventually realised that
actually, it's no different from a single Bayesian filter; it's all down to what
tokens are selected.   using a bigger token db has the same effect as using 2
token dbs.

I also tested ideas about giving parts of the message more statistical
importance (back in early 2002 iirc?), and found it to be less effective when
tested using 10-fold cross-validation.

Henry, Matt and myself also tested using bayes instead of SA's additive rule
scoring, without any useful good results.

I'd suggest we'd need to see results from a 10-fold cross validation run to
convince us that things haven't changed since those tests.  (doco on this is in
the wiki btw.)
Comment 7 Justin Mason 2005-01-24 10:25:18 UTC
*** Bug 4089 has been marked as a duplicate of this bug. ***
Comment 8 Marc Perkel 2005-01-24 11:19:03 UTC
Subject: Re:  Using Bayesian Filters to score rules

I'm seeing significantly different results that what you have seen. When 
I exclude most of the message body - except for the "hot" parts - links, 
email addresses, phone numbers, and - I enhance the headers with some 
extra dns info - I'm seeing more accurate results.

The reason this works is because the difference in the body of the 
messages between spam and ham isn't as great as the parts of the 
messages I'm looking at.

Look at it this way - if I can use an analogy.

By excluding the body - it's like having a bart tub 1/3 full of very hot 
water. By including the body it's like having the bath tub full of warm 
water. The full tub might contain more total heat - but less 
temperature. And - I think temperature - not total heat - is the best 
way to detect spam accurately.

What I'm saying is that the bulk of the message body dilutes the 
bayesian results moving messages towards the center of the scale. 
Stripping out the bulk of the body makes the results move towards the 
ends of the scale.

And - getting back to the subject of this bug - I hope to be able to try 
replacing scores on rules with automatic bayesian scoring some time this 
week. I'll let you know how it does.