Bug 5257 - RFE: adaptive autolearning thresholds
Summary: RFE: adaptive autolearning thresholds
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
Depends on: 5270
  Show dependency tree
Reported: 2006-12-29 09:48 UTC by Justin Mason
Modified: 2019-09-25 03:33 UTC (History)
1 user (show)

Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2006-12-29 09:48:22 UTC
I think we need to reduce the frequency of autolearning mails as ham; it doesn't
seem to cause major trouble for me at least, but anecdotally it's not good. 
worth investigating in the 3.2.0 mass-check/rescoring, anyway.
Comment 1 Justin Mason 2007-01-14 06:43:26 UTC
this takes place after the perceptron run
Comment 2 Justin Mason 2007-02-24 13:09:08 UTC
OK, I've set the autolearn ham threshold to -1.0, which collects 1.21% of
ham.  autolearn spam threshold is then 12.0,  for 81% of spam.
Comment 3 Justin Mason 2007-02-24 13:11:09 UTC
Comment 4 Sidney Markowitz 2007-06-07 11:10:26 UTC
After reading the comments in bug 5497 and its talk about the complaints on the
user list about Bayes performance after this change, and reading over the
comments here that show that this change was made based on the supposition that
it was needed without clear statistics, I propose that this be reverted in time
for the 3.2.1 release.
Comment 5 Justin Mason 2007-06-07 11:38:41 UTC
ok; +1

I still think we're probably allowing autolearning too much spam as ham, but if
the fact that too little ham is being learned is having bad effects in itself
that are worse than that, we can revert to the 3.1.x behaviour.
Comment 6 Daryl C. W. O'Shea 2007-06-07 12:26:48 UTC
Comment 7 Sidney Markowitz 2007-06-07 12:56:37 UTC
Committed to branch 3.2 revision 545281.
Committed to trunk revision 545287.

Comment 8 Sidney Markowitz 2007-06-09 11:44:14 UTC
I'm reopening this because if there was a reason to open this in the first
place, then that reason still exists now that we reverted what was supposed to
fix it.

I think that we should consider how to have an adaptive autolearning threshold
based on sampling a configurable percentage of the best configurable percentage
of the ham and spam. To clarify: Identify the threshold score that gives us the
lowest scoring X% of the ham, then autolearn Y% of those hams. X is set at a
value which is unlikely to result in spam being learned as ham. Y is
configurable in case the volume of mail is too high to learn everything that is
below the threshold, but allows us to learn a representative sample of ham, not
just the very lowest scoring. That protects against an effect such as all mail
of a certain type triggering a 1.0 score rule and then Bayes incorrectly
learning that mail of that type is always spam.
Comment 9 Justin Mason 2007-08-12 06:38:57 UTC
3.2.3 was released without these fixed, moving to 3.2.3
Comment 10 Justin Mason 2007-08-12 06:39:13 UTC
er, 3.2.4. ;)
Comment 11 Justin Mason 2007-12-05 02:06:13 UTC
no movement -> pushing out to 3.3.0, optimistically
Comment 12 Justin Mason 2009-06-29 04:27:05 UTC
pushing out further