Bug 6473

Summary: Making Bayes Learn RelayCountry Metadata
Product: Spamassassin Reporter: RW <rwmaillists>
Component: PluginsAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: apache, billcole, giovanni, kmcgrail, rwmaillists
Priority: P2    
Version: unspecified   
Target Milestone: Undefined   
Hardware: PC   
OS: FreeBSD   
Whiteboard:
Attachments: Patch to add Bayes-specific Relaycountry metadata
Updated patch for Bayes-specific Relaycountry metadata

Description RW 2010-07-29 19:53:58 UTC
Created attachment 4794 [details]
Patch to add Bayes-specific Relaycountry metadata

Bayes doesn't learn tokens shorter than 3 characters and so discards all the two-letter country codes in the RelayCountry metadata.

As the existing format is well suited to header rules, and to avoid breaking existing local rules, I suggest adding additional metadata specifically for Bayes.

I've attached a patch.  It produces a token for the first trusted country, plus a token for each country change e.g.  

 "US US CA NG"  becomes "Trusted_US USCA CANG"

I think this is better than simply having a token per country as that loses all information about ordering e.g. if you are running SA in the UK then "TW" and "CZ TW" might be all spam, but "GB TW" and "US TW" could be less spammy due to travellers using  TW IP addresses to connect their submission servers. 

Ordered pairs are also more resistant to forged headers. If a spammer adds extra received headers as bayes poison and sends it though a foreign country, it will show as a spammy pair rather than a hammy country code e.g CNGB is spammy because the ordering is wrong.
Comment 1 Henrik Krohns 2011-05-25 07:59:07 UTC
*** Bug 6433 has been marked as a duplicate of this bug. ***
Comment 2 Giovanni Bechis 2018-02-03 11:37:58 UTC
I think this could be useful, IMH more food for bayes is better.
Any opinions ?
Comment 3 RW 2018-02-03 13:38:06 UTC
Created attachment 5522 [details]
Updated patch  for Bayes-specific Relaycountry metadata
Comment 4 Bill Cole 2018-02-04 18:59:49 UTC
(In reply to Giovanni Bechis from comment #2)
> I think this could be useful, IMH more food for bayes is better.
> Any opinions ?

+1
Comment 5 Kevin A. McGrail 2018-02-21 12:19:58 UTC
RW, any chance we can get an ICLA https://www.apache.org/licenses/icla.pdf to consider this patch?
Comment 6 Henrik Krohns 2018-02-21 14:34:32 UTC
Sorry to be a downer, but in the words of Justin Mason, any Bayes modification should go through a https://wiki.apache.org/spamassassin/TenFoldCrossValidation. Long time ago I messed around adding all sorts of tokens and did some 10fcv tests, sometimes results were even worse. So I wouldn't necessarily go claming moar crap the better.
Comment 7 Kevin A. McGrail 2018-02-21 14:36:54 UTC
(In reply to Henrik Krohns from comment #6)
> Sorry to be a downer, but in the words of Justin Mason, any Bayes
> modification should go through a
> https://wiki.apache.org/spamassassin/TenFoldCrossValidation. Long time ago I
> messed around adding all sorts of tokens and did some 10fcv tests, sometimes
> results were even worse. So I wouldn't necessarily go claming moar crap the
> better.

Great point Henrik.  tfcv is standard for me to consider the patch.