SA Bugzilla – Bug 6828
Adjust default autolearn settings to reduce Bayesian mistraining under default configuration
Last modified: 2015-04-07 13:26:10 UTC
Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3 If autolearning is enabled by default (which is a good idea) then the system should have very conservative defaults to reduce the possibility that spams will be learned as hams. It's better to take longer to get a corpus sufficient to enable Bayes analysis than it is to autolearn messages improperly. See users list 2012-08-15 "Very spammy messages yield BAYES_00"
Has anyone ever actually done any testing on autolearning to verify it helps or determine optimal thresholds?
(In reply to comment #1) > Has anyone ever actually done any testing on autolearning to verify it helps > or determine optimal thresholds? No idea. -3 was a WAG.
Bear in mind that hardly any default nice rules contribute to autolearning, all the contributing rules with non-neglible scores are DNS whitelists, the very thing that created the problem in user list thread in the first place. See also bug 6344
(In reply to comment #0) > Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3 > > If autolearning is enabled by default (which is a good idea) then the system > should have very conservative defaults to reduce the possibility that spams > will be learned as hams. It's better to take longer to get a corpus > sufficient to enable Bayes analysis than it is to autolearn messages > improperly. > > See users list 2012-08-15 "Very spammy messages yield BAYES_00" +1 on this.
(In reply to comment #1) > Has anyone ever actually done any testing on autolearning to verify it helps > or determine optimal thresholds? tested and using -4 and autolearn only (no manual trainig) on a very mixed user base and site wide Bayes has been very reliable..
(In reply to comment #5) > tested and using -4 and autolearn only (no manual trainig) on a very mixed > user base and site wide Bayes has been very reliable.. The trouble with making ham autolearning dependent on DNS whitelists is that the training can change dramatically with the scores of those rules. If you started training a while ago when RCVD_IN_DNSWL_MED scored -4, then you will have trained on a much wider selection that you if start over now. Currently you'll be reliant on RCVD_IN_DNSWL_HI and combinations like RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated mail from companies like Amazon, and direct marketing mail, but probably almost no person to person mail. Also if someone turns-off DNS whitelists they wont learn any ham at all.
(In reply to comment #6) > (In reply to comment #5) > > > tested and using -4 and autolearn only (no manual trainig) on a very mixed > > user base and site wide Bayes has been very reliable.. > > The trouble with making ham autolearning dependent on DNS whitelists is that > the training can change dramatically with the scores of those rules. If you > started training a while ago when RCVD_IN_DNSWL_MED scored -4, then you > will have trained on a much wider selection that you if start over now. > Currently you'll be reliant on RCVD_IN_DNSWL_HI and combinations like > RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated > mail from companies like Amazon, and direct marketing mail, but probably > almost no person to person mail. > > Also if someone turns-off DNS whitelists they wont learn any ham at all. FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled. (I don't trust third parties/keys for WLing) Autolearning ham has never been an issue on a mixed language system. (in the last 8 years, I have never fed Bayes manually )
(In reply to comment #7) > FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled. > (I don't trust third parties/keys for WLing) > Autolearning ham has never been an issue on a mixed language system. > (in the last 8 years, I have never fed Bayes manually ) Then how do you get to -4?
(In reply to comment #8) > (In reply to comment #7) > > > FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled. > > (I don't trust third parties/keys for WLing) > > Autolearning ham has never been an issue on a mixed language system. > > (in the last 8 years, I have never fed Bayes manually ) > > Then how do you get to -4? from production settings: use_bayes 1 bayes_auto_learn 1 bayes_auto_expire 0 bayes_min_ham_num 200 bayes_min_spam_num 200 bayes_auto_learn_threshold_nonspam -3.0 bayes_auto_learn_threshold_spam 20.0 "it just works"
It does seems that lowering the threshold for learning as ham makes sense to try and avoid any FNs slipping through based on anecdotal complaints. I think this is also being extrapolated to a spam threshold change as well. Anyone have suggestions on a testing protocol that might help decide the defaults? If I am thinking correctly, if we used masscheck data, the scoring is designed not to mark spam as ham and ham as spam. So the minimum threshold should be the spam threshold. This means that 12.0 is chosen at random as an experienced guess for a number inflated for real-world safety. Going further, my system is configured for 6.0 instead of 5.0 with a lot of single-fire rules and things that focus on scoring ham. So it doesn't make it a good source of project-wide data concerning auto-learning thresholds. In fact, I'm wondering a bit if a default setup can score below a zero very often and if we are now going to skew bayes towards only certain classifications of ham. And in the end, none of our tweaked system data and configuration are relevant to this discussion. Looking at the thresholds, we really need a scientific approach based on the DEFAULT configurations to continue this discussion. bayes_auto_learn_threshold_nonspam n.nn (default: 0.1) bayes_auto_learn_threshold_spam n.nn (default: 12.0) And, in the end, I wonder also if we are missing turning on bayes_auto_learn_on_error as a default. I think for 3.4.0 turning this setting on and losing the backwards compatibility makes sense. Regards, KAM
(In reply to comment #6) > > Also if someone turns-off DNS whitelists they wont learn any ham at all. I'd point out the object of this exercise is to keep an unconfigured or minimally-configured SA install from going off the rails. If the admin is involved enough to be disabling DNSWL lookups, they are likely involved enough to look at and tune the autolearn settings, especially if given guidance in the wiki.
> Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3 -1, I do not agree. In 2007 we had to bump the ham threshold from -1 to 0.1 to widen a too narrow view on ham. See Bug 5497 (and its predecessor Bug 5257).
(In reply to comment #12) > > Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3 > > -1, I do not agree. > > In 2007 we had to bump the ham threshold from -1 to 0.1 > to widen a too narrow view on ham. > > See Bug 5497 (and its predecessor Bug 5257). Agreed. As mentioned above, "none of our tweaked system data and configuration are relevant to this discussion." I think note 5497 remains open and this should be marked as a duplicate really. But we perhaps could use some additional information in the wiki to help admins, perhaps? John, what do you think of that?
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
Closing as won't fix. Perhaps better for a Wiki or Readme entry about Bayes tweaks.