Bug 1829 - autolearning should automatically balance ham/spam
Summary: autolearning should automatically balance ham/spam
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P3 normal
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: needs code
Keywords:
: 1722 2073 2229 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-04-28 00:16 UTC by Daniel Quinlan
Modified: 2006-12-30 19:39 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2003-04-28 00:16:19 UTC
jm writes regarding the loss of the forgeable compensation rules:

> as noted in mail, this may be very bad news for auto-learning, since it'll
> massively reduce the # of mails with score < -2.   The AWL helps a bit, since
> that'll bring scores down, but without -a, no auto-learned ham will be found.

My proposal is to keep a history of non-bayes message scores.  Sort the history
by the score.  Learn on messages in the bottom quartile as ham and in the
top quartile as spam.

This has several advantages:

1. adapts to changes in configuration and spam over time
2. easier to configure for users (only one option is needed, the percentage
   off the top and bottom to use)
3. automatically balances amount of spam and ham being used for autolearning
4. doesn't require us to ship negative rules, but still allows SA to easily
   adapt if the user has added their own negative rules.  :-)
Comment 1 Marc Perkel 2003-04-28 07:10:19 UTC
Subject: Re: [SAdev]  New: autolearning discriminator should be
 replaced for 2.60

The problem with whitelisting is that there are very few rules you can 
write to give white points that spammers can't fake. I have one rule 
that give a point - just one point - for not having and links in the 
message.

I've done a number of tricks to get non-spam into bayes lately. The 
problem is that if everyone did what I'm doing then the spammers cound 
do the same thing. It works because it's not an official set of rules.

Having said that - I have created a number of semi-secret white rules 
that are somewhat customized to my userbase. I noticed - for example - 
that users subscribe to a number of lists and I have whitelisted mail 
coming from this lists. I have also looked at where messages link to and 
if they have links to popular news sources - I had a few white points 
for that.

I also host a lot of people who talk about political and legal issues. 
In these discussions there are common words and phrases that I have 
never seen in spam and I give a few white points for talking about 
politics and law.

Then - there's hosts that are 100% non-spam hosts and I can look in the 
received lines for these hosts knowing that everything coming from them 
is not spam. Combining these tricks has really whitened my non-spam 
where many messages are getting double digit negative scores.

Having said that - what make my tricks work is that this list applies to 
my servers alone and are not part of spam assassin. If these rules of 
mine became part of spam assassin then they would quickly turn from 
white rules to black rules as they are abused. But - the trick is to 
write sume rules that look at these things and have the ability to keep 
the list in a separate file - so the rule refers to the file that 
contains the secret white information.

That is - Duncan - why I'm asking for rules the reffer to lists - prod 
prod..

Anyhow - I feel your pain - I have something that works but I can't 
recommend it to add it to spam assassin or it will break it for me. But 
it's not hard to create something for yourselves that gathers non-spam 
on your system using personal rules. However - I think we need to come 
up with a trick to allow admins to easilly personalize white rules in a 
text list format so as to have good data for training Bayes.


Comment 2 Daniel Quinlan 2003-04-28 14:11:05 UTC
Note: This bug is about the autolearning discrimination code.  The exact
nature of the rules, positive or negative, being fed into it are orthogonal
to how we decide to learn on a message or not.
Comment 3 Aaron Sherman 2003-05-02 09:31:49 UTC
Daniel, this is a good idea, and makes sense to do for 2.60.

However, let me point you to my earlier (long) comments in bug #1589

I think you're hitting the same sort of thing. Let's broaden our thinking here a
bit and consider that AWLs are essentially just another way of giving tokens a
history and judging any given token based on the history. Sound familliar?

In my proposed long-term goal, AWL would be redundant, and would be replaced by
a header token of the form "token:fromaddr:ajs@ajs.com" which would be figured
into the final result just like every other token, and AWL fades away into being
yet-another-special-case. In fact this also gets you weighting based on the
presence of any given header in a way that cannot be generally abused by
spammers. Go check it out, and see what you think. I still agree that this
proposed change is the right thing for 2.60, as what I propose will take at
least a couple of months to put together.
Comment 4 Daniel Quinlan 2003-05-02 13:56:20 UTC
Subject: Re:  autolearning discriminator should be replaced for 2.60

> However, let me point you to my earlier (long) comments in bug #1589

Please create a new bug for this idea (which is interesting).  You don't
need to attach it to every barely-related bug in the database to get
someone to read it.

Daniel

Comment 5 Marc Perkel 2003-05-03 09:23:39 UTC
Subject: Re: [SAdev]  autolearning discriminator should be replaced
 for 2.60

I can see the logic in considering AWL to be redundant. My thinking is 
that AWL was basing on the From address and spammers can put anything in 
there for a From address. Most spam with links in them have bogus from 
addresses anyhow.

As to the AWL concept it seems like something that would be more 
identifying that the From address are the Received lines - the hosts 
where spam comes from - and what spam links to. Those take a lot more 
trouble to fake.

Comment 6 Sidney Markowitz 2003-05-28 16:24:37 UTC
I would like to second the original proposal in this bug. My ISP does not yet 
have an interface to sa-learn, only spamc to a server dedicated to 
SpamAssassin/spamd. To get Bayes rules to work I increased the autolearn ham 
threshold to 2.0 (based on noticing that I am not getting false negatives with 
lower scores than that) until I can gather 200 autolearned hams.

Instead of maintaining a history of non-bayes scores to find certain 
percentile thresholds, would it work as well to simply update an average ham 
and average spam score? Or also approximate values for standard deviation? I 
don't see why it needs to be so exact as to keep a full history and use exact 
percentile thresholds.
Comment 7 Theo Van Dinter 2003-06-12 15:46:54 UTC
since it looks like there's been no work done on this, I'm punting it to 2.70.
Comment 8 Sidney Markowitz 2003-06-16 19:25:49 UTC
*** Bug 2073 has been marked as a duplicate of this bug. ***
Comment 9 Justin Mason 2004-01-30 20:07:58 UTC
lowering pri on this.  face it, the AWL works ;)
Comment 10 Daniel Quinlan 2004-03-12 22:08:32 UTC
*** Bug 2229 has been marked as a duplicate of this bug. ***
Comment 11 Daniel Quinlan 2004-08-13 22:37:05 UTC
re-raising priority, this bug is not about AWL despite off-topic comments from
people
Comment 12 Daniel Quinlan 2004-11-30 21:11:10 UTC
*** Bug 1722 has been marked as a duplicate of this bug. ***
Comment 13 Daniel Quinlan 2005-03-30 01:08:26 UTC
move bug to Future milestone (previously set to Future -- I hope)
Comment 14 Theo Van Dinter 2006-12-30 19:39:04 UTC
Ok, nothing has ever happened with this ticket, so closing.  This kind of thing
can be done via a plugin now anyway. :)