Bug 2981 - inoculation support?
Summary: inoculation support?
Status: RESOLVED LATER
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P4 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-28 14:54 UTC by Justin Mason
Modified: 2004-08-27 10:07 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-01-28 14:54:40 UTC
BTW, here's an interesting idea we ran into at the Spam Conf 2004.

http://lists.netsys.com/pipermail/full-disclosure/2003-November/013840.html
http://www.nuclearelephant.com/projects/dspam/draft-spamfilt-inoculation-01.txt

Basically, it's quite simple -- a standard MIME wrapper for training
spam filters.

My issue with this proposal, however, is what happens when you have
a trained db with these tokens:

        SPAMCOUNT       HAMCOUNT        TOKEN
        1               3               foo
        1               3               bar

Note, both are hammy tokens.

If you have 8 friends who have you in their inoculation list, and they all     
 get copies of *1* single spam message containing "bar" as a token, and they all
inoculate you, that'll result in:

        SPAMCOUNT       HAMCOUNT        TOKEN
        1               3               foo
        9               3               bar

hence -- "bar" becomes a strongly spammy token, even though in reality that was
a result of a single spam run.

In other words, inoculation does bad things for Bayes training; inoculated
tokens, IMO, are likely to be "stronger" in result than personally-trained tokens.

This could be avoided by using a hash of the message body somehow as a message
identifier, so that once 1 person inoculates you for a given spam, you will
learn it once and ignore future inoculations.    -- but then the issue there is,
what is a reliable message id for spam, given that spammers routinely evade body
hashing, fake message-id headers, etc.?

comments?
Comment 1 Daniel Quinlan 2004-03-11 15:23:13 UTC
+1 on reassigning this ticket to 3.1 since it is (a) non-trivial and (b) not
a feature we have even considered for 3.0.
Comment 2 Daniel Quinlan 2004-08-27 18:07:28 UTC
I don't think this idea is really catching on, closing as LATER.