|Summary:||RFE: adaptive autolearning thresholds|
|Product:||Spamassassin||Reporter:||Justin Mason <jm>|
|Component:||Rules||Assignee:||SpamAssassin Developer Mailing List <dev>|
|Version:||SVN Trunk (Latest Devel Version)|
|Bug Depends on:||5270|
Description Justin Mason 2006-12-29 09:48:22 UTC
I think we need to reduce the frequency of autolearning mails as ham; it doesn't seem to cause major trouble for me at least, but anecdotally it's not good. worth investigating in the 3.2.0 mass-check/rescoring, anyway.
Comment 1 Justin Mason 2007-01-14 06:43:26 UTC
this takes place after the perceptron run
Comment 2 Justin Mason 2007-02-24 13:09:08 UTC
OK, I've set the autolearn ham threshold to -1.0, which collects 1.21% of ham. autolearn spam threshold is then 12.0, for 81% of spam.
Comment 3 Justin Mason 2007-02-24 13:11:09 UTC
Comment 4 Sidney Markowitz 2007-06-07 11:10:26 UTC
After reading the comments in bug 5497 and its talk about the complaints on the user list about Bayes performance after this change, and reading over the comments here that show that this change was made based on the supposition that it was needed without clear statistics, I propose that this be reverted in time for the 3.2.1 release.
Comment 5 Justin Mason 2007-06-07 11:38:41 UTC
ok; +1 I still think we're probably allowing autolearning too much spam as ham, but if the fact that too little ham is being learned is having bad effects in itself that are worse than that, we can revert to the 3.1.x behaviour.
Comment 6 Daryl C. W. O'Shea 2007-06-07 12:26:48 UTC
Comment 7 Sidney Markowitz 2007-06-07 12:56:37 UTC
Committed to branch 3.2 revision 545281. Committed to trunk revision 545287.
Comment 8 Sidney Markowitz 2007-06-09 11:44:14 UTC
I'm reopening this because if there was a reason to open this in the first place, then that reason still exists now that we reverted what was supposed to fix it. I think that we should consider how to have an adaptive autolearning threshold based on sampling a configurable percentage of the best configurable percentage of the ham and spam. To clarify: Identify the threshold score that gives us the lowest scoring X% of the ham, then autolearn Y% of those hams. X is set at a value which is unlikely to result in spam being learned as ham. Y is configurable in case the volume of mail is too high to learn everything that is below the threshold, but allows us to learn a representative sample of ham, not just the very lowest scoring. That protects against an effect such as all mail of a certain type triggering a 1.0 score rule and then Bayes incorrectly learning that mail of that type is always spam.
Comment 9 Justin Mason 2007-08-12 06:38:57 UTC
3.2.3 was released without these fixed, moving to 3.2.3
Comment 10 Justin Mason 2007-08-12 06:39:13 UTC
er, 3.2.4. ;)
Comment 11 Justin Mason 2007-12-05 02:06:13 UTC
no movement -> pushing out to 3.3.0, optimistically
Comment 12 Justin Mason 2009-06-29 04:27:05 UTC
pushing out further