Bug 6828 - Adjust default autolearn settings to reduce Bayesian mistraining under default configuration
Summary: Adjust default autolearn settings to reduce Bayesian mistraining under defaul...
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC All
: P2 normal
Target Milestone: 3.4.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-15 22:18 UTC by John Hardin
Modified: 2015-04-07 13:26 UTC (History)
4 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description John Hardin 2012-08-15 22:18:40 UTC
Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3

If autolearning is enabled by default (which is a good idea) then the system should have very conservative defaults to reduce the possibility that spams will be learned as hams. It's better to take longer to get a corpus sufficient to enable Bayes analysis than it is to autolearn messages improperly.

See users list 2012-08-15 "Very spammy messages yield BAYES_00"
Comment 1 Darxus 2012-08-15 22:22:56 UTC
Has anyone ever actually done any testing on autolearning to verify it helps or determine optimal thresholds?
Comment 2 John Hardin 2012-08-15 22:26:56 UTC
(In reply to comment #1)
> Has anyone ever actually done any testing on autolearning to verify it helps
> or determine optimal thresholds?

No idea. -3 was a WAG.
Comment 3 RW 2012-08-15 22:41:43 UTC
Bear in mind that hardly any default nice rules contribute to autolearning, all the contributing rules with non-neglible scores are DNS whitelists, the very thing that created the problem in user list thread in the first place.

See also bug 6344
Comment 4 AXB 2012-08-16 06:25:26 UTC
(In reply to comment #0)
> Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3
> 
> If autolearning is enabled by default (which is a good idea) then the system
> should have very conservative defaults to reduce the possibility that spams
> will be learned as hams. It's better to take longer to get a corpus
> sufficient to enable Bayes analysis than it is to autolearn messages
> improperly.
> 
> See users list 2012-08-15 "Very spammy messages yield BAYES_00"

+1 on this.
Comment 5 AXB 2012-08-16 06:27:51 UTC
(In reply to comment #1)
> Has anyone ever actually done any testing on autolearning to verify it helps
> or determine optimal thresholds?

tested and using -4 and autolearn only (no manual trainig) on a very mixed user base and site wide Bayes has been very reliable..
Comment 6 RW 2012-08-16 11:19:32 UTC
(In reply to comment #5)

> tested and using -4 and autolearn only (no manual trainig) on a very mixed
> user base and site wide Bayes has been very reliable..

The trouble with making ham autolearning dependent on DNS whitelists is that the training can change dramatically with the scores of those rules. If you started training a while ago when  RCVD_IN_DNSWL_MED scored -4, then you will have trained on a much wider selection that you if start over now. Currently you'll be reliant on RCVD_IN_DNSWL_HI and combinations like RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated mail from companies like Amazon, and direct marketing mail, but probably almost no person to person mail. 

Also if someone turns-off DNS whitelists they wont learn any ham at all.
Comment 7 AXB 2012-08-16 11:55:35 UTC
(In reply to comment #6)
> (In reply to comment #5)
> 
> > tested and using -4 and autolearn only (no manual trainig) on a very mixed
> > user base and site wide Bayes has been very reliable..
> 
> The trouble with making ham autolearning dependent on DNS whitelists is that
> the training can change dramatically with the scores of those rules. If you
> started training a while ago when  RCVD_IN_DNSWL_MED scored -4, then you
> will have trained on a much wider selection that you if start over now.
> Currently you'll be reliant on RCVD_IN_DNSWL_HI and combinations like
> RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated
> mail from companies like Amazon, and direct marketing mail, but probably
> almost no person to person mail. 
> 
> Also if someone turns-off DNS whitelists they wont learn any ham at all.

FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
(I don't trust third parties/keys for WLing)
Autolearning ham has never been an issue on a mixed language system.
(in the last 8 years, I have never fed Bayes manually )
Comment 8 RW 2012-08-16 11:59:35 UTC
(In reply to comment #7)

> FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
> (I don't trust third parties/keys for WLing)
> Autolearning ham has never been an issue on a mixed language system.
> (in the last 8 years, I have never fed Bayes manually )

Then how do you get to -4?
Comment 9 AXB 2012-08-16 12:24:31 UTC
(In reply to comment #8)
> (In reply to comment #7)
> 
> > FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
> > (I don't trust third parties/keys for WLing)
> > Autolearning ham has never been an issue on a mixed language system.
> > (in the last 8 years, I have never fed Bayes manually )
> 
> Then how do you get to -4?

from production settings:

use_bayes 1
bayes_auto_learn  1
bayes_auto_expire  0


bayes_min_ham_num  200
bayes_min_spam_num 200

bayes_auto_learn_threshold_nonspam -3.0
bayes_auto_learn_threshold_spam 20.0

"it just works"
Comment 10 Kevin A. McGrail 2012-08-16 12:55:22 UTC
It does seems that lowering the threshold for learning as ham makes sense to try and avoid any FNs slipping through based on anecdotal complaints.  I think this is also being extrapolated to a spam threshold change as well.

Anyone have suggestions on a testing protocol that might help decide the defaults?  If I am thinking correctly, if we used masscheck data, the scoring is designed not to mark spam as ham and ham as spam.  So the minimum threshold should be the spam threshold.  This means that 12.0 is chosen at random as an experienced guess for a number inflated for real-world safety.

Going further, my system is configured for 6.0 instead of 5.0 with a lot of single-fire rules and things that focus on scoring ham.  So it doesn't make it a good source of project-wide data concerning auto-learning thresholds.

In fact, I'm wondering a bit if a default setup can score below a zero very often and if we are now going to skew bayes towards only certain classifications of ham.

And in the end, none of our tweaked system data and configuration are relevant to this discussion.


Looking at the thresholds, we really need a scientific approach based on the DEFAULT configurations to continue this discussion.

bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
bayes_auto_learn_threshold_spam n.nn      (default: 12.0)

And, in the end, I wonder also if we are missing turning on bayes_auto_learn_on_error as a default.  I think for 3.4.0 turning this setting on and losing the backwards compatibility makes sense.

Regards,
KAM
Comment 11 John Hardin 2012-08-16 13:15:54 UTC
(In reply to comment #6)
> 
> Also if someone turns-off DNS whitelists they wont learn any ham at all.

I'd point out the object of this exercise is to keep an unconfigured or minimally-configured SA install from going off the rails. If the admin is involved enough to be disabling DNSWL lookups, they are likely involved enough to look at and tune the autolearn settings, especially if given guidance in the wiki.
Comment 12 Mark Martinec 2012-08-16 14:59:39 UTC
> Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3

-1, I do not agree.

In 2007 we had to bump the ham threshold from -1 to 0.1
to widen a too narrow view on ham.

See Bug 5497 (and its predecessor Bug 5257).
Comment 13 Kevin A. McGrail 2012-08-17 16:14:16 UTC
(In reply to comment #12)
> > Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3
> 
> -1, I do not agree.
> 
> In 2007 we had to bump the ham threshold from -1 to 0.1
> to widen a too narrow view on ham.
> 
> See Bug 5497 (and its predecessor Bug 5257).

Agreed. As mentioned above, "none of our tweaked system data and configuration are relevant to this discussion."

I think note 5497 remains open and this should be marked as a duplicate really.

But we perhaps could use some additional information in the wiki to help admins, perhaps?  John, what do you think of that?
Comment 14 Kevin A. McGrail 2013-06-21 16:28:54 UTC
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
Comment 15 Kevin A. McGrail 2015-04-07 13:26:10 UTC
Closing as won't fix.  Perhaps better for a Wiki or Readme entry about Bayes tweaks.