Bug 5655 - Bayes not considering add_header'ed information
Summary: Bayes not considering add_header'ed information
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.2.0
Hardware: All All
: P3 normal
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-09-20 04:26 UTC by Matthias Leisi
Modified: 2021-04-12 23:32 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Leisi 2007-09-20 04:26:23 UTC
It seems that Bayes is not considering headers added through add_header
directives in *.cf files:

--- cut ---
add_header all Spammy _SPAMMYTOKENS(5,compact)_
add_header all Hammy _HAMMYTOKENS(5,compact)_
 
ifplugin Mail::SpamAssassin::Plugin::ASN
asn_lookup asn.routeviews.org _ASN_ _ASNCIDR_
add_header all ASN _ASN_ _ASNCIDR_
endif

use_bayes 1
bayes_path /var/lib/nobody/.spamassassin/bayes
bayes_auto_learn 1
bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-DNSWL
bayes_ignore_header X-Whitelisted-Leisi
bayes_ignore_header X-Spam-Prev-Subject
bayes_ignore_header X-Virus-Scanned
bayes_ignore_header X-Sieve
--- cut ---
 
A typical mail from the users@spamass mailing list produces the following headers:

--- cut ---
X-Spam-ASN: AS3701 140.211.0.0/16
X-Spam-Hammy: 0.000-+--H*Ad:D*spamassassin.apache.org,
        0.000-+--Hlist-unsubscribe:sk:users-u,
        0.000-+--Hlist-help:sk:users-h,
        0.000-+--H*Ad:U*users, 0.000-+--HTo:U*users
X-Spam-Spammy: 0.999-1--belief, 0.913-+--H*r:unknown,
        0.900-+--H*x:Outlook,
        0.900-+--H*UA:Outlook, 0.890-+--H*UA:Microsoft
--- cut ---

Most of the mails to that list have "autolearn=ham", and thus I would expect
that some part of "AS3701 140.211.0.0/16" would end up in the X-Spam-Hammy items. 

The particular server where the headers are taken from is running SA 3.2.0,
using Perl 5.8.8; however a test installation using 3.2.3 seems to show the same
behaviour (but running on a smaller set of Bayes data).
Comment 1 Michael Parker 2007-09-20 09:36:59 UTC
Is there documentation that implies this will happen?

Headers are added at the end of processing so I would never expect them to be
considered.  What needs to happen is the plugin should add metadata that is then
used by other rules and bayes.

Perhaps a subject change is in order.
Comment 2 Matt Kettler 2007-09-20 16:00:18 UTC
Agreed with Michael. Unless a plugin adds the header early on it won't be
available as metadata for bayes. See how the relaycountry plugin adds it
directly to allow bayes to see it. (And Bayes will see it, even if there are no
related add_header directives).

Also, even if the bayes system did see it, and the message was autolearned, the
token in question would have to exist in bayes *before* the message was scanned.
(ie: autolearning doesn't go back and re-update the bayes score and tokens for
the message)

Finally, if you feed it to sa-learn later, spamassassin will explicitly ignore
these headers, and this is very much by design.

sa-learn explicitly strips out all the markups SA added (if any) before learning
the message. This prevents SA from learning about itself instead of the message.
Comment 3 Daryl C. W. O'Shea 2007-09-20 17:27:15 UTC
(In reply to comment #1)
> Is there documentation that implies this will happen?

Sort of.  The ASN plugin does say that it makes the metadata available to bayes,
but it (rightly) doesn't say to use add_header to make that happen.

> Headers are added at the end of processing so I would never expect them to be
> considered.  What needs to happen is the plugin should add metadata that is then
> used by other rules and bayes.

The ASN plugin *does* add the required metadata as soon as the dns queries are
processed.

The add_header stuff is just optional cosmetics for message markup.  You're
probably not seeing it in your HAMMYTOKENS headers since the data is unlikely to
be one of the 5 most significant tokens.  Run a message through with debug
enabled and you'll likely see the tokens you're looking for.  Once you've done
that please come back here and either provide debug output that shows it's not
working or close the bug if it is working.
Comment 4 RW 2010-05-07 18:01:04 UTC
Could someone take a look at this. I just came to report this problem and found this 2 year old bug report. 

It's not just that the tokens don't make it to the top 5 hammy/spammy list. In my experience they never show-up at all in spamassassin -D bayes. Sometimes the ASN debug appears before tokenization, sometimes after, it seems to make no difference.

I would think that for most people the sole purpose of running this plugin is to make the information available to bayes. A very least the documentation should be fixed, if it's purely informational.
Comment 5 Mark Martinec 2011-01-06 14:55:55 UTC
(In reply to comment #3)
> The ASN plugin *does* add the required metadata as soon as the dns queries
> are processed.

Apparently it doesn't (or maybe it doesn't any more).

trunk:
  Bug 5655: Bayes not considering add_header'ed information:
  let the ASN plugin provide meta-information that can be
  tokenized by bayes in a form of X-ASN and X-ASN-Route
  header fields
Sending lib/Mail/SpamAssassin/Plugin/ASN.pm
Committed revision 1056043.
Comment 6 Mark Martinec 2011-01-07 11:27:32 UTC
(In reply to comment #4)
> It's not just that the tokens don't make it to the top 5 hammy/spammy list. In
> my experience they never show-up at all in spamassassin -D bayes. Sometimes the
> ASN debug appears before tokenization, sometimes after, it seems to make no
> difference.

Indeed that is true. The ASN plugin runs at a priority 0 as most
other rules, and the BAYES runs even earlier:

  60_shortcircuit.cf: priority BAYES_99 -400

so I don't see how meta-information provided by the ASN plugin
could contribute to bayes tokens, unless bayes priority is moved
up to run late, or ASN priority pushed down very early, earlier
than -400.

Btw, the meta-information provided by my yesterday's change (previous
posting) does at least end up in *learned* tokens, as bayes learning
comes late.

So, as it stands now, the ASN is not useful for bayes, the documentation
is misleading.
Comment 7 Mark Martinec 2011-01-07 11:39:25 UTC
> Indeed that is true. The ASN plugin runs at a priority 0 as most
> other rules, and the BAYES runs even earlier at -400.

Sorry, correction: ASN starts its queries early as a "parsed_metadata"
plugin hook. But that is no guarantee that DNS results will be available
by the time BAYES plugin does its tokenizations.

Bumping up the BAYES_* rules priority to a positive value mitigates
the problem in most cases. If one doesn't use shortcircuit rules,
it should be safe to make bayes run late.
Comment 8 Benny Pedersen 2017-02-19 23:29:35 UTC
and asn fails on ipv6 in sa 3.4.1 currently, maybe just missing the correct ipv6 config in rules ?
Comment 9 Henrik Krohns 2019-04-15 13:32:42 UTC
Trunk can now use GeoIP ASN for immediate lookups without DNS.

Sending        UPGRADE
Sending        lib/Mail/SpamAssassin/Plugin/ASN.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1857580.

In theory it should be always available to Bayes now.. but for some reason the putted_metadata doesn't seem to be used. Leaving this open to investigate more.
Comment 10 Henrik Krohns 2021-04-10 09:43:18 UTC
Bayes has access to ASN metadata now. Had to move processing from parsed_metadata to extract_metadata, so it happens before Bayes tokenizes.

Sending        trunk/lib/Mail/SpamAssassin/Plugin/ASN.pm
Sending        trunk/lib/Mail/SpamAssassin.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1888576.
Comment 11 RW 2021-04-12 19:25:42 UTC
As I understand it the default is using a DNS look-up to get the ASN information. So what happens when sa-learn runs? Does the look-up get repeated?

If it doesn't the ASN tokens will come only from auto-training with no possibility of correction.
Comment 12 Henrik Krohns 2021-04-12 23:32:49 UTC
DNS is only used for ASN if GeoIP is not available.

In reality Bayes can't even use any DNS based data, since tokenizing is done mere moments after async DNS is launched (dns prio -100, bayes -90). There is no time to wait for actual replies.

Sa-learn doesn't even seem to harvest DNS replies, if some plugin decided to query stuff. I guess all this was the reason for docs saying that network lookups are not used even without -L. In the commit you see I blocked ASN from sending DNS needlessly when learning, for some reason it was the only one that did.

Now thinking about it, I guess sa-learn -L should be made a default option.. not sure what kind of network based test would even produce usable Bayes data. ASN and any similar static data should be queried from local database.