SA Bugzilla – Bug 4095
Using Bayesian Filters to score rules
Last modified: 2005-01-24 01:25:18 UTC
I want to throw out a thought. I think we can get rid of scores for rules and let a bayesian filter do automatic scoring. Here's how it would work. We keep the rules with an indication as to if the rule is initially a black or white rule. As the initial messages come in they are evaluated against the rules and a list of triggered rules are fed into a SEPARATE bayesian filter that is only used to score rules. If the message is sufficiently extreme ham or spam then the message is autolearned by ALL the bayesian filters. Once the system is trained then the bayesian filter for the rules is what generates the score. We also have to rethink the idea of scores because scores will be a fraction between 0 and 1 instead of points that are added or subtracted. The result is not a yes or no but rather a fraction that indicates how spammy or hammy the message is. This result can be used to decide what to do with the message. On my system I have got away from the "this is spam" model. I have many classifications. ham - autolearned nonspam - not spam - but not sure enough to autolearn low-spam - this is probably spam - but a few false positives end up here high-spam - these messages are bounce to the sender - autolearned veryhigh-spam - these message are just dropped so as not to become bounce spam - autolearned pure-trash - I drop these at connect time The idea is that the scoring of these rules are automatic based on the reliability of the hits on these rules - and - the score varies from server to server based on the kind of spam and ham received. After the filter is trained you can write any rule you want and if you write a good rule - it will develop a good score. Rules that score in the middle can be automatically culled. This bayesian filter is separate and apart from the other bayesian filters. The other bayesian filters report to this filter with their (fractional) results and are also automatically evaluated. I'm getting lose to 99.9% accuracy with the tricks I'm doing so far. This one can really kisk the accuracy up there - but it requires a significal shift in the way you think about spam and scoring.
BTW - if we can come up with a way to do it - I'd love to challenge anyone here to a spam filtering contest. Marc Perkel
Thanks for your suggestions, but we really need working SA code to evaluate something elaborate like this. Similar ideas have been discussed before, but I don't believe the improvement was statistically significant. And no code...
Subject: Re: Using Bayesian Filters to score rules I know it would take a massive change - but I believe it will be worth it. I have been testing using a second bayesian filter and the results are incredable. The second filter is spamprobe - but I'm feeding it just the headers and from the body of the message only the links, phone numbers, and email addresses. And it is close to 100% accurate especially wherethe normal bayes filter fails. It is working so well that I believe that this could easilly be adapted to score only rules. I know it's going to take time to warm up to the idea. I figure if I start talking about it now that in two years everyone will finally get it. ;) But - it is a different way of thinking about things - and - you do have to roll it around in your mind for quite a while before you see the big picture. I think I could probably cobble something together out of spamprobe and a little perl to try it out. I'll let you all know if I get it working. If I do - then you smart folks can go back and do it right.
Subject: Re: New: Ising Bayesian Filters to score rules > Once the system is trained then the bayesian filter for the rules is what > generates the score. 1. How do you decide when the system is sufficiently trained to switch over to scoring this way? 2. What do you use for scoring until the switchover? 3. This implies that you have rules for most all of the things that are interesting spam (and ham) signs. This is a lot of work maintaining rules, which currently isn't as mandatory with the current Bayes setup. 4. If you don't have a rule for something then you can't score it. Currently Bayes effectively generates the equivalent of its own rule for that new spam sign.
Subject: Re: Using Bayesian Filters to score rules bugzilla-daemon@bugzilla.spamassassin.org wrote: >http://bugzilla.spamassassin.org/show_bug.cgi?id=4095 > > > > > >------- Additional Comments From lwilton@earthlink.net 2005-01-22 21:12 ------- >Subject: Re: New: Ising Bayesian Filters to score rules > > >1. How do you decide when the system is sufficiently trained to switch >over to scoring this way? > > What I envision is that rules are listed as white or black initially. No scores are assigned. Then the initial message are tested and they will return a list of rules triggered. If the message triggers say 3/4 black rules or more then it is learned as spam. If it triggers 3/4 white rules - then it is learned as ham. Once the learning process begins then the rules themselves develop scores just like bayesian tokens develop scores. As the system learns the scoring takes cate of itself. Messages end up with a number between 0 an 1 and then you just have to figure out where you want to call it spam and what to do with it. >2. What do you use for scoring until the switchover? > > See above. >3. This implies that you have rules for most all of the things that are >interesting spam (and ham) signs. This is a lot of work maintaining rules, >which currently isn't as mandatory with the current Bayes setup. > > We still use the same rules we have now - we just don't need to score them. If the rule is triggered then the name of the rule goes into the bayesian filter. >4. If you don't have a rule for something then you can't score it. >Currently Bayes effectively generates the equivalent of its own rule for >that new spam sign. > > > > > Yes - bayes generates its own scores off of tokens. And we still have bayesian filters that look at the message. But what I'm saying is that we also habe another bayesian filter that is fed only a list of rules that it triggered and it does the final scoring. And - some of those rules in the list include the results of other bayesian filters looking at the message. I hope I'm not losing everyone in this concept. It's really hard to get the big picture into words.
I looked into multiple levels of Bayesian filters, and eventually realised that actually, it's no different from a single Bayesian filter; it's all down to what tokens are selected. using a bigger token db has the same effect as using 2 token dbs. I also tested ideas about giving parts of the message more statistical importance (back in early 2002 iirc?), and found it to be less effective when tested using 10-fold cross-validation. Henry, Matt and myself also tested using bayes instead of SA's additive rule scoring, without any useful good results. I'd suggest we'd need to see results from a 10-fold cross validation run to convince us that things haven't changed since those tests. (doco on this is in the wiki btw.)
*** Bug 4089 has been marked as a duplicate of this bug. ***
Subject: Re: Using Bayesian Filters to score rules I'm seeing significantly different results that what you have seen. When I exclude most of the message body - except for the "hot" parts - links, email addresses, phone numbers, and - I enhance the headers with some extra dns info - I'm seeing more accurate results. The reason this works is because the difference in the body of the messages between spam and ham isn't as great as the parts of the messages I'm looking at. Look at it this way - if I can use an analogy. By excluding the body - it's like having a bart tub 1/3 full of very hot water. By including the body it's like having the bath tub full of warm water. The full tub might contain more total heat - but less temperature. And - I think temperature - not total heat - is the best way to detect spam accurately. What I'm saying is that the bulk of the message body dilutes the bayesian results moving messages towards the center of the scale. Stripping out the bulk of the body makes the results move towards the ends of the scale. And - getting back to the subject of this bug - I hope to be able to try replacing scores on rules with automatic bayesian scoring some time this week. I'll let you know how it does.