SA Bugzilla – Bug 4787
BAYES_99 hits on all mail
Last modified: 2019-06-24 16:31:51 UTC
Anyone ever seen anything like this? I'm wondering if there is a compelling reason make SA stop learning spam tokens if the ham:spam token ratio exceeds a certain level??? I guess I could increase the bayes_auto_learn_threshold_spam higher than 12 to limit the amount of spam tokens learned... but it would be cool to have the learner shut off spam token learning if the ratio is out of whack... or visa versa with ham token learning if it outweighs the spam token count. FWIW, this is the first time i've ever seen this happen since moving to SQL bayes. [root@email spamassassin]# grep result: spamd.log | wc -l 12914 [root@email spamassassin]# grep result: spamd.log | grep BAYES_99 | wc -l 12909 Sending a ham sample through hits like this... X-Spam-Bayes-Tc-Spammy: 100 X-Spam-Status: No, hits=4.5 required=5.0 X-Spam-Bayes-Spammy-Tokens: 1.000-+--H*RU:rdns, 1.000-+--H*RU:helo, 1.000-+--H*RU:ident, 1.000-+--H*RU:intl, 1.000-+--H*RU:envfrom, 1.000-+--H*RU:auth, 1.000-+--HTo:D*net, 1.000-+--H*Ad:D*net, 1.000-+--H*F:D*com, 1.000-+--here X-Spam-Bayes-Tc-Hammy: X-Spam-Score: 4.5 X-Spam-Level: **** X-Spam-Bayes-Hammy-Tokens: X-Spam-Bayes: 1.0000 X-Spam-Bayes-Tc-Learned: 101 X-Spam-Bayes-Summary: Tokens: new, 53; hammy, 0; neutral, 1; spammy, 100. X-Spam-Bayes-Tc: 154 X-Spam-Report: 4.5 points, 5.0 required * 1.0 NO_REAL_NAME From: does not include a real name * 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.0000] mysql> select * from bayes_vars; +----+----------+------------+-----------+-------------+-------------+---------- --------+--------------------+------------------+------------------+ | id | username | spam_count | ham_count | token_count | last_expire | last_atime_delta | last_expire_reduce | oldest_token_age | newest_token_age | +----+----------+------------+-----------+-------------+-------------+---------- --------+--------------------+------------------+------------------+ | 1 | $GLOBAL | 523306 | 67272 | 4680055 | 1139497034 | 7200 | 51609 | 1139314224 | 1139501076 | +----+----------+------------+-----------+-------------+-------------+---------- --------+--------------------+------------------+------------------+ 1 row in set (0.00 sec)
this looks similar to something I brought up a while ago: having only a small number of tokens hit, and/or having only a small number of tokens in either the head or body independently, gives bad results from bayes.
just had this happen again on another box running 0.1 and 12.0 autolearn thresholds. i believe the problem is as ham tokens get expired and spam tokens are the only thing left in bayes_token, BAYES_99 starts hitting on all email. The min 200 spam and 200 ham messages required is not a good thing to go off of (except on a brand new install). I dont know how spam_count and ham_count in bayes_vars table can accurately represent your token distribution.... so having a min spam and min ham token count would be optimal. As you can see, bayes_vars says its learned 93k ham, but out of that, there are currently only 137 ham toks in the table. mysql> select spam_count,ham_count,token_count from bayes_vars; +------------+-----------+-------------+ | spam_count | ham_count | token_count | +------------+-----------+-------------+ | 2463944 | 93579 | 3620764 | +------------+-----------+-------------+ mysql> select sum(spam_count), sum(ham_count) from bayes_token; +-----------------+----------------+ | sum(spam_count) | sum(ham_count) | +-----------------+----------------+ | 34846107 | 137 | +-----------------+----------------+ 1 row in set (0.00 sec) I propose the following addition to bayes config options.. bayes_min_ham_tokens <num> bayes_min_spam_tokens <num> and maybe something extra on top of that like ham to spam token ratio. bayes_ham_spam_token_ratio 0.5 # require 1 ham for every 2 spam tokens Now i realize that the query is more expensive to sum(spam_count) and sum(ham_count) on bayes_token table, so I think the bayes_var table could add a couple cols after `token_count`, say `spam_token_count` and `ham_token_count`, and auto-expiry/sa-learn could update those fields when it runs. That way the query to pull those counts remains efficient. Also, if those 2 fields are available, calculating the ham:spam ratio is cake. if this sounds like an acceptable solution, i may work on it. unless anyone has reasons why its not better to do it this way?
+1 it sounds sane to me.
i'll have a POF ready soon on this... just finishing testing. in the last 4 hours, my ham:spam token ratio has dropped by a quarter point (trying to get under 2:1), so its working as expected :) 2006-06-20 12:06:54.668510500 [18515] dbg: bayes: ham:spam token ratio (2.84:1), min ratio (0.5:1), max ratio (2:1) 2006-06-20 16:28:01.666901500 [10881] dbg: bayes: ham:spam token ratio (2.59:1), min ratio (0.5:1), max ratio (2:1)
excuse the typo. thats POC. not POF :)
Created attachment 3567 [details] proof of concept this patch implements token tracking to prevent issues where lots of ham/spam has been learned, but all the tokens have been expired for one, causing bayes to lean too far one way ie (BAYES_00 or BAYES_99 on all mail). it also implements ham:spam ratio restrictions, which will prevent the autolearner from learning too much ham when the ratio is high, and too much spam with the ratio is low. the proof of concept code only applies to the BayesStore/SQL.pm, so in order to test it, you'd need to be using bayes_store_module Mail::SpamAssassin::BayesStore::SQL since my box that i'm testing here learns alot of spam, and little ham, the token ratio is always on the bottom end of the min ratio. [12005] dbg: bayes: ham:spam token ratio (0.74:1), min ratio (0.75:1), max ratio (1.25:1) [12005] dbg: bayes: skip autolearn of spam because ham:spam token ratio (0.74) is less than min ratio (0.75) as you can see from the autolearn results, its skipped a bunch of spam learns today... # grep -c autolearn=ham spamd.log 652 # grep -c autolearn=spam spamd.log 859 # grep -c autolearn=unavailable spamd.log 5141 but thats because i've set my min/max ratios so close at 0.75-1.25. If you want to learn alot more spam, you could simply use 0.5-2.0 which is the default... or you could even lower that 0.5 to something like 0.25 if you want to learn up to 4x more spam than ham. realize that this code is not drop in ready, as it requires a couple SQL alters to track spam/ham token counts. ALTER TABLE bayes_vars ADD spam_token_count int(11) NOT NULL default '0' AFTER token_count; ALTER TABLE bayes_vars ADD ham_token_count int(11) NOT NULL default '0' AFTER spam_token_count;
I like it overall, I think -- esp if it fixes the problem! -- but: 1. bayes_min_ham_tokens/bayes_min_spam_tokens -- I'd prefer to leave those out unless they're needed; 2. it obviously needs to be implemented for the other BayesStore backends; 3. Michael, what's your take on the changes to the BayesStore APIs? is that safe? (it's not a public plugin API, but still). in particular, changing the number of returned items for nspam_nham_get() is a big change; adding an additional (and separate) API would be better. 4. I think there's probably some SQL doc that would also need changing.
Would it make sense to have the expire code also pay attention to the ratio and extend the allowed age of the smaller group of tokens if the ratio is getting out of whack. You would what to limit how much the allowd age of a token could be extended (perhaps double the normal value).
(In reply to comment #8) > Would it make sense to have the expire code also pay attention to the ratio > and extend the allowed age of the smaller group of tokens if the ratio is > getting out of whack. You would what to limit how much the allowd age of a > token could be extended (perhaps double the normal value). i had it calculating the ratio in token_expiration() and skipping expire if the ratio was out of whack, but i ended up pulling it because it seemed to help get the ratio back on track faster. YMMV.
Created attachment 3568 [details] small fix somehow i got an 'i' stuck in there on line 572 that shouldnt have been there.
(In reply to comment #7) > I like it overall, I think -- esp if it fixes the problem! -- but: > 1. bayes_min_ham_tokens/bayes_min_spam_tokens -- I'd prefer to leave those out > unless they're needed; bayes_min_ham_tokens/bayes_min_spam_tokens is what fixes this bug to begin with. the token ratio check was just something extra to help the learner learn what we need more of and not what we dont. having bayes_min_(ham|spam)_tokens ensures we dont learn when we dont have enough token data. bayes_min_(ham|spam)_num does not assure us of this, as expiry could knock off alot of the token data and the (ham|spam)_count in bayes_vars does not account for that. i've seen 200+ ham learned where actual ham token count in bayes_token is very small. if you dont do this, and opt just for the token ratio path, then you'd need to have ratio logic in is_scan_available() to skip bayes when the ratio is whacked out... because right now, the ratio logic is only being applied to learn(), in hopes it will help equalize it.
Closing old stale bugs. Seems resolved..