SA Bugzilla – Bug 1829
autolearning should automatically balance ham/spam
Last modified: 2006-12-30 19:39:04 UTC
jm writes regarding the loss of the forgeable compensation rules: > as noted in mail, this may be very bad news for auto-learning, since it'll > massively reduce the # of mails with score < -2. The AWL helps a bit, since > that'll bring scores down, but without -a, no auto-learned ham will be found. My proposal is to keep a history of non-bayes message scores. Sort the history by the score. Learn on messages in the bottom quartile as ham and in the top quartile as spam. This has several advantages: 1. adapts to changes in configuration and spam over time 2. easier to configure for users (only one option is needed, the percentage off the top and bottom to use) 3. automatically balances amount of spam and ham being used for autolearning 4. doesn't require us to ship negative rules, but still allows SA to easily adapt if the user has added their own negative rules. :-)
Subject: Re: [SAdev] New: autolearning discriminator should be replaced for 2.60 The problem with whitelisting is that there are very few rules you can write to give white points that spammers can't fake. I have one rule that give a point - just one point - for not having and links in the message. I've done a number of tricks to get non-spam into bayes lately. The problem is that if everyone did what I'm doing then the spammers cound do the same thing. It works because it's not an official set of rules. Having said that - I have created a number of semi-secret white rules that are somewhat customized to my userbase. I noticed - for example - that users subscribe to a number of lists and I have whitelisted mail coming from this lists. I have also looked at where messages link to and if they have links to popular news sources - I had a few white points for that. I also host a lot of people who talk about political and legal issues. In these discussions there are common words and phrases that I have never seen in spam and I give a few white points for talking about politics and law. Then - there's hosts that are 100% non-spam hosts and I can look in the received lines for these hosts knowing that everything coming from them is not spam. Combining these tricks has really whitened my non-spam where many messages are getting double digit negative scores. Having said that - what make my tricks work is that this list applies to my servers alone and are not part of spam assassin. If these rules of mine became part of spam assassin then they would quickly turn from white rules to black rules as they are abused. But - the trick is to write sume rules that look at these things and have the ability to keep the list in a separate file - so the rule refers to the file that contains the secret white information. That is - Duncan - why I'm asking for rules the reffer to lists - prod prod.. Anyhow - I feel your pain - I have something that works but I can't recommend it to add it to spam assassin or it will break it for me. But it's not hard to create something for yourselves that gathers non-spam on your system using personal rules. However - I think we need to come up with a trick to allow admins to easilly personalize white rules in a text list format so as to have good data for training Bayes.
Note: This bug is about the autolearning discrimination code. The exact nature of the rules, positive or negative, being fed into it are orthogonal to how we decide to learn on a message or not.
Daniel, this is a good idea, and makes sense to do for 2.60. However, let me point you to my earlier (long) comments in bug #1589 I think you're hitting the same sort of thing. Let's broaden our thinking here a bit and consider that AWLs are essentially just another way of giving tokens a history and judging any given token based on the history. Sound familliar? In my proposed long-term goal, AWL would be redundant, and would be replaced by a header token of the form "token:fromaddr:ajs@ajs.com" which would be figured into the final result just like every other token, and AWL fades away into being yet-another-special-case. In fact this also gets you weighting based on the presence of any given header in a way that cannot be generally abused by spammers. Go check it out, and see what you think. I still agree that this proposed change is the right thing for 2.60, as what I propose will take at least a couple of months to put together.
Subject: Re: autolearning discriminator should be replaced for 2.60 > However, let me point you to my earlier (long) comments in bug #1589 Please create a new bug for this idea (which is interesting). You don't need to attach it to every barely-related bug in the database to get someone to read it. Daniel
Subject: Re: [SAdev] autolearning discriminator should be replaced for 2.60 I can see the logic in considering AWL to be redundant. My thinking is that AWL was basing on the From address and spammers can put anything in there for a From address. Most spam with links in them have bogus from addresses anyhow. As to the AWL concept it seems like something that would be more identifying that the From address are the Received lines - the hosts where spam comes from - and what spam links to. Those take a lot more trouble to fake.
I would like to second the original proposal in this bug. My ISP does not yet have an interface to sa-learn, only spamc to a server dedicated to SpamAssassin/spamd. To get Bayes rules to work I increased the autolearn ham threshold to 2.0 (based on noticing that I am not getting false negatives with lower scores than that) until I can gather 200 autolearned hams. Instead of maintaining a history of non-bayes scores to find certain percentile thresholds, would it work as well to simply update an average ham and average spam score? Or also approximate values for standard deviation? I don't see why it needs to be so exact as to keep a full history and use exact percentile thresholds.
since it looks like there's been no work done on this, I'm punting it to 2.70.
*** Bug 2073 has been marked as a duplicate of this bug. ***
lowering pri on this. face it, the AWL works ;)
*** Bug 2229 has been marked as a duplicate of this bug. ***
re-raising priority, this bug is not about AWL despite off-topic comments from people
*** Bug 1722 has been marked as a duplicate of this bug. ***
move bug to Future milestone (previously set to Future -- I hope)
Ok, nothing has ever happened with this ticket, so closing. This kind of thing can be done via a plugin now anyway. :)