|
SA Bugzilla – Full Text Bug Listing |
Summary: | Making Bayes Learn RelayCountry Metadata | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | RW <rwmaillists> |
Component: | Plugins | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | apache, billcole, giovanni, kmcgrail, rwmaillists |
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | Undefined | ||
Hardware: | PC | ||
OS: | FreeBSD | ||
Whiteboard: | |||
Attachments: |
Patch to add Bayes-specific Relaycountry metadata
Updated patch for Bayes-specific Relaycountry metadata |
*** Bug 6433 has been marked as a duplicate of this bug. *** I think this could be useful, IMH more food for bayes is better. Any opinions ? Created attachment 5522 [details]
Updated patch for Bayes-specific Relaycountry metadata
(In reply to Giovanni Bechis from comment #2) > I think this could be useful, IMH more food for bayes is better. > Any opinions ? +1 RW, any chance we can get an ICLA https://www.apache.org/licenses/icla.pdf to consider this patch? Sorry to be a downer, but in the words of Justin Mason, any Bayes modification should go through a https://wiki.apache.org/spamassassin/TenFoldCrossValidation. Long time ago I messed around adding all sorts of tokens and did some 10fcv tests, sometimes results were even worse. So I wouldn't necessarily go claming moar crap the better. (In reply to Henrik Krohns from comment #6) > Sorry to be a downer, but in the words of Justin Mason, any Bayes > modification should go through a > https://wiki.apache.org/spamassassin/TenFoldCrossValidation. Long time ago I > messed around adding all sorts of tokens and did some 10fcv tests, sometimes > results were even worse. So I wouldn't necessarily go claming moar crap the > better. Great point Henrik. tfcv is standard for me to consider the patch. |
Created attachment 4794 [details] Patch to add Bayes-specific Relaycountry metadata Bayes doesn't learn tokens shorter than 3 characters and so discards all the two-letter country codes in the RelayCountry metadata. As the existing format is well suited to header rules, and to avoid breaking existing local rules, I suggest adding additional metadata specifically for Bayes. I've attached a patch. It produces a token for the first trusted country, plus a token for each country change e.g. "US US CA NG" becomes "Trusted_US USCA CANG" I think this is better than simply having a token per country as that loses all information about ordering e.g. if you are running SA in the UK then "TW" and "CZ TW" might be all spam, but "GB TW" and "US TW" could be less spammy due to travellers using TW IP addresses to connect their submission servers. Ordered pairs are also more resistant to forged headers. If a spammer adds extra received headers as bayes poison and sends it though a foreign country, it will show as a spammy pair rather than a hammy country code e.g CNGB is spammy because the ordering is wrong.