SA Bugzilla – Bug 5655
Bayes not considering add_header'ed information
Last modified: 2021-04-12 23:32:49 UTC
It seems that Bayes is not considering headers added through add_header directives in *.cf files: --- cut --- add_header all Spammy _SPAMMYTOKENS(5,compact)_ add_header all Hammy _HAMMYTOKENS(5,compact)_ ifplugin Mail::SpamAssassin::Plugin::ASN asn_lookup asn.routeviews.org _ASN_ _ASNCIDR_ add_header all ASN _ASN_ _ASNCIDR_ endif use_bayes 1 bayes_path /var/lib/nobody/.spamassassin/bayes bayes_auto_learn 1 bayes_ignore_header X-Bogosity bayes_ignore_header X-Spam-Flag bayes_ignore_header X-Spam-Status bayes_ignore_header X-Spam-DNSWL bayes_ignore_header X-Whitelisted-Leisi bayes_ignore_header X-Spam-Prev-Subject bayes_ignore_header X-Virus-Scanned bayes_ignore_header X-Sieve --- cut --- A typical mail from the users@spamass mailing list produces the following headers: --- cut --- X-Spam-ASN: AS3701 140.211.0.0/16 X-Spam-Hammy: 0.000-+--H*Ad:D*spamassassin.apache.org, 0.000-+--Hlist-unsubscribe:sk:users-u, 0.000-+--Hlist-help:sk:users-h, 0.000-+--H*Ad:U*users, 0.000-+--HTo:U*users X-Spam-Spammy: 0.999-1--belief, 0.913-+--H*r:unknown, 0.900-+--H*x:Outlook, 0.900-+--H*UA:Outlook, 0.890-+--H*UA:Microsoft --- cut --- Most of the mails to that list have "autolearn=ham", and thus I would expect that some part of "AS3701 140.211.0.0/16" would end up in the X-Spam-Hammy items. The particular server where the headers are taken from is running SA 3.2.0, using Perl 5.8.8; however a test installation using 3.2.3 seems to show the same behaviour (but running on a smaller set of Bayes data).
Is there documentation that implies this will happen? Headers are added at the end of processing so I would never expect them to be considered. What needs to happen is the plugin should add metadata that is then used by other rules and bayes. Perhaps a subject change is in order.
Agreed with Michael. Unless a plugin adds the header early on it won't be available as metadata for bayes. See how the relaycountry plugin adds it directly to allow bayes to see it. (And Bayes will see it, even if there are no related add_header directives). Also, even if the bayes system did see it, and the message was autolearned, the token in question would have to exist in bayes *before* the message was scanned. (ie: autolearning doesn't go back and re-update the bayes score and tokens for the message) Finally, if you feed it to sa-learn later, spamassassin will explicitly ignore these headers, and this is very much by design. sa-learn explicitly strips out all the markups SA added (if any) before learning the message. This prevents SA from learning about itself instead of the message.
(In reply to comment #1) > Is there documentation that implies this will happen? Sort of. The ASN plugin does say that it makes the metadata available to bayes, but it (rightly) doesn't say to use add_header to make that happen. > Headers are added at the end of processing so I would never expect them to be > considered. What needs to happen is the plugin should add metadata that is then > used by other rules and bayes. The ASN plugin *does* add the required metadata as soon as the dns queries are processed. The add_header stuff is just optional cosmetics for message markup. You're probably not seeing it in your HAMMYTOKENS headers since the data is unlikely to be one of the 5 most significant tokens. Run a message through with debug enabled and you'll likely see the tokens you're looking for. Once you've done that please come back here and either provide debug output that shows it's not working or close the bug if it is working.
Could someone take a look at this. I just came to report this problem and found this 2 year old bug report. It's not just that the tokens don't make it to the top 5 hammy/spammy list. In my experience they never show-up at all in spamassassin -D bayes. Sometimes the ASN debug appears before tokenization, sometimes after, it seems to make no difference. I would think that for most people the sole purpose of running this plugin is to make the information available to bayes. A very least the documentation should be fixed, if it's purely informational.
(In reply to comment #3) > The ASN plugin *does* add the required metadata as soon as the dns queries > are processed. Apparently it doesn't (or maybe it doesn't any more). trunk: Bug 5655: Bayes not considering add_header'ed information: let the ASN plugin provide meta-information that can be tokenized by bayes in a form of X-ASN and X-ASN-Route header fields Sending lib/Mail/SpamAssassin/Plugin/ASN.pm Committed revision 1056043.
(In reply to comment #4) > It's not just that the tokens don't make it to the top 5 hammy/spammy list. In > my experience they never show-up at all in spamassassin -D bayes. Sometimes the > ASN debug appears before tokenization, sometimes after, it seems to make no > difference. Indeed that is true. The ASN plugin runs at a priority 0 as most other rules, and the BAYES runs even earlier: 60_shortcircuit.cf: priority BAYES_99 -400 so I don't see how meta-information provided by the ASN plugin could contribute to bayes tokens, unless bayes priority is moved up to run late, or ASN priority pushed down very early, earlier than -400. Btw, the meta-information provided by my yesterday's change (previous posting) does at least end up in *learned* tokens, as bayes learning comes late. So, as it stands now, the ASN is not useful for bayes, the documentation is misleading.
> Indeed that is true. The ASN plugin runs at a priority 0 as most > other rules, and the BAYES runs even earlier at -400. Sorry, correction: ASN starts its queries early as a "parsed_metadata" plugin hook. But that is no guarantee that DNS results will be available by the time BAYES plugin does its tokenizations. Bumping up the BAYES_* rules priority to a positive value mitigates the problem in most cases. If one doesn't use shortcircuit rules, it should be safe to make bayes run late.
and asn fails on ipv6 in sa 3.4.1 currently, maybe just missing the correct ipv6 config in rules ?
Trunk can now use GeoIP ASN for immediate lookups without DNS. Sending UPGRADE Sending lib/Mail/SpamAssassin/Plugin/ASN.pm Transmitting file data ..done Committing transaction... Committed revision 1857580. In theory it should be always available to Bayes now.. but for some reason the putted_metadata doesn't seem to be used. Leaving this open to investigate more.
Bayes has access to ASN metadata now. Had to move processing from parsed_metadata to extract_metadata, so it happens before Bayes tokenizes. Sending trunk/lib/Mail/SpamAssassin/Plugin/ASN.pm Sending trunk/lib/Mail/SpamAssassin.pm Transmitting file data ..done Committing transaction... Committed revision 1888576.
As I understand it the default is using a DNS look-up to get the ASN information. So what happens when sa-learn runs? Does the look-up get repeated? If it doesn't the ASN tokens will come only from auto-training with no possibility of correction.
DNS is only used for ASN if GeoIP is not available. In reality Bayes can't even use any DNS based data, since tokenizing is done mere moments after async DNS is launched (dns prio -100, bayes -90). There is no time to wait for actual replies. Sa-learn doesn't even seem to harvest DNS replies, if some plugin decided to query stuff. I guess all this was the reason for docs saying that network lookups are not used even without -L. In the commit you see I blocked ASN from sending DNS needlessly when learning, for some reason it was the only one that did. Now thinking about it, I guess sa-learn -L should be made a default option.. not sure what kind of network based test would even produce usable Bayes data. ASN and any similar static data should be queried from local database.