Bug 5062

Summary: RFE: an autolearn plugin to implement "SA-Unsupervised"
Product: Spamassassin Reporter: Justin Mason <jm>
Component: LearnerAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: antispam
Priority: P5    
Version: 3.1.4   
Target Milestone: Future   
Hardware: Other   
OS: other   
Whiteboard:

Description Justin Mason 2006-08-23 09:08:12 UTC
in http://plg.uwaterloo.ca/~gvcormac/spamcormack06.pdf and earlier TREC
evaluation papers, Gordon Cormack discusses use of "SA-Unsupervised",
SpamAssassin using a simple auto-learning procedure:

> SA-Unsupervised. SpamAssassin 2.3 (Unsupervised automated feedback.) filter-
train is invoked after every message, but with SpamAssassin’s output
classification rather than the gold standard; that is, its own judgement is fed
back to itself as if it were the gold standard.

SA-Unsupervised outperforms SA-Standard -- the default auto-learning regime --
quite frequently ;) Given this, I think we should offer an autolearn plugin
which implements SA-Unsupervised autolearning, as an off-by-default option.
Comment 1 Justin Mason 2006-12-05 06:44:52 UTC
unlikely to happen in 3.2.0
Comment 2 Mark Martinec 2009-11-11 11:44:22 UTC
unlikely to happen in 3.3.0
Comment 3 Adam Katz 2009-11-12 15:13:54 UTC
(In reply to comment #0)
> in http://plg.uwaterloo.ca/~gvcormac/spamcormack06.pdf and earlier [...]

From the paper (as cited above),
>> SA-Unsupervised. SpamAssassin 2.3 (Unsupervised automated feedback.) filter-
>> train is invoked after every message, but with SpamAssassin’s output
>> classification rather than the gold standard; that is, its own judgement is
>> fed back to itself as if it were the gold standard.

Responding to the bug subject,
> Bug 5062 - RFE: an autolearn plugin to implement "SA-Unsupervised"  

Assuming a threshold of 5.0, that would be:

    bayes_auto_learn_threshold_nonspam 4.999
    bayes_auto_learn_threshold_spam    5.000

No code would be needed unless you want that to dynamically read the require_score.  However, I consider this dangerous; read on.


Justin concluded:
> SA-Unsupervised outperforms SA-Standard -- the default auto-learning regime --
> quite frequently ;) Given this, I think we should offer an autolearn plugin
> which implements SA-Unsupervised autolearning, as an off-by-default option.

Perhaps I'm mis-reading the linked paper, but it does not appear to conclude that on any metric.  Here's Table III from section 7.1 (page 10) *, comparing the different SpamAssassin configurations:

FILTER            HAM MISCLASSIFY%   SPAM MISCLASSIFY%  OVERALL MISCLASSIFY%
SA-Supervised     0.07 (0.02-0.14)   1.51 (1.39-1.63)   1.24 (1.15-1.35)
SA-Bayes          0.17 (0.09-0.27)   2.10 (1.96-2.24)   1.74 (1.63-1.86)
SA-Nolearn        0.19 (0.11-0.30)   9.49 (9.21-9.78)   7.78 (7.54-8.02)
SA-Standard       0.07 (0.02-0.14)   7.49 (7.23-7.75)   6.12 (5.91-6.34)
SA-Unsupervised   0.11 (0.05-0.20)   8.11 (7.84-8.38)   6.63 (6.41-6.86)
SA-Human          0.09 (0.04-0.18)   1.06 (0.97-1.17)   0.88 (0.80-0.97)

* I omitted 1-AUC from that table; since the ROCs for SA-Standard and SA-Unsupervised (and SA-Nolearn) "intersect many times," it is uninformative.

So SA-Unsupervised misclassified 0.11% of ham and 8.11% of spam for a total misclassification of 6.65% within that paper's corpus.  SA-Standard had fewer misclassifications than SA-Unsupervised in every category (the opposite of what this bug claims!), though not in any statistically significant way (compare by certainty; 0.02-0.16 vs 0.05-0.20 is almost wholly overlap, and its extremes differ by 0.03 and 0.04, which are both within the margin of error).

Later on in the paper (on p17), the Spam Learning Performance and its Ham counterpart (tables VI and VII) show SA-Unsupervised again tying SA-Standard on Spam learning (with an insignificant edge favoring SA-Standard by 0.01%) and then LOSING against SA-Standard on Ham learning at the initial phase while tying in the final phase.


I'll repeat and clarify:  In the corpora used by this study, SA-Standard and SA-Unsupervised perform the same.  There is a difference favoring SA-Standard, but it is statistically insignificant.


The fact that they scored the same is quite interesting to me, as it was not at all expected.  Specifically, I wonder how AWL affected it, and I wonder what (if any) learning thresholds are best.  My top suspicion is that the paper's corpora weren't conclusive enough, though SA-Unsupervised was far better at catching "Advertising" spam (from Table X, p19) while there were no stand-outs in other genres (which all slightly favored SA-Standard).

I think if this idea is worth latching onto, the GA should begin to experiment with different autolearn thresholds.  As Justin understands the paper, setting the bayes_auto_learn_threshold* values to the marking threshold (4.999/5.000) is better than the default values (0.100/12.000).  While my take on the paper's data differs, it is still interesting to ponder over how to properly tweak these values.  I'd expect it is best determined relative to some combination of the overall ham:spam ratio, the active ham:spam ratio in the bayes db, and the average score of ham messages and spam messages.