Bug 4787 - BAYES_99 hits on all mail
Summary: BAYES_99 hits on all mail
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-09 17:28 UTC by Dallas Engelken
Modified: 2019-06-24 16:31 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
proof of concept patch None Dallas Engelken [HasCLA]
small fix patch None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dallas Engelken 2006-02-09 17:28:45 UTC
Anyone ever seen anything like this?  I'm wondering if there is a compelling 
reason make SA stop learning spam tokens if the ham:spam token ratio exceeds a 
certain level???   

I guess I could increase the bayes_auto_learn_threshold_spam higher than 12 to 
limit the amount of spam tokens learned... but it would be cool to have the 
learner shut off spam token learning if the ratio is out of whack... or visa 
versa with ham token learning if it outweighs the spam token count.

FWIW, this is the first time i've ever seen this happen since moving to SQL 
bayes.

[root@email spamassassin]# grep result: spamd.log | wc -l
12914
[root@email spamassassin]# grep result: spamd.log | grep BAYES_99 | wc -l
12909

Sending a ham sample through hits like this...

X-Spam-Bayes-Tc-Spammy: 100
X-Spam-Status: No, hits=4.5 required=5.0
X-Spam-Bayes-Spammy-Tokens: 1.000-+--H*RU:rdns, 1.000-+--H*RU:helo,
        1.000-+--H*RU:ident, 1.000-+--H*RU:intl, 1.000-+--H*RU:envfrom,
        1.000-+--H*RU:auth, 1.000-+--HTo:D*net, 1.000-+--H*Ad:D*net,
        1.000-+--H*F:D*com, 1.000-+--here
X-Spam-Bayes-Tc-Hammy:
X-Spam-Score: 4.5
X-Spam-Level: ****
X-Spam-Bayes-Hammy-Tokens:
X-Spam-Bayes: 1.0000
X-Spam-Bayes-Tc-Learned: 101
X-Spam-Bayes-Summary: Tokens: new, 53; hammy, 0; neutral, 1; spammy, 100.
X-Spam-Bayes-Tc: 154
X-Spam-Report: 4.5 points, 5.0 required
        *  1.0 NO_REAL_NAME From: does not include a real name
        *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
        *      [score: 1.0000]



mysql> select * from bayes_vars;
+----+----------+------------+-----------+-------------+-------------+----------
--------+--------------------+------------------+------------------+
| id | username | spam_count | ham_count | token_count | last_expire | 
last_atime_delta | last_expire_reduce | oldest_token_age | newest_token_age |
+----+----------+------------+-----------+-------------+-------------+----------
--------+--------------------+------------------+------------------+
|  1 | $GLOBAL  |     523306 |     67272 |     4680055 |  1139497034 
|             7200 |              51609 |       1139314224 |       1139501076 |
+----+----------+------------+-----------+-------------+-------------+----------
--------+--------------------+------------------+------------------+
1 row in set (0.00 sec)
Comment 1 Theo Van Dinter 2006-02-09 23:13:30 UTC
this looks similar to something I brought up a while ago: having only a small number of tokens hit, and/or 
having only a small number of tokens in either the head or body independently, gives bad results from 
bayes.
Comment 2 Dallas Engelken 2006-03-15 15:58:30 UTC
just had this happen again on another box running 0.1 and 12.0 autolearn thresholds.

i believe the problem is as ham tokens get expired and spam tokens are the only
thing left in bayes_token, BAYES_99 starts hitting on all email.   The min 200
spam and 200 ham messages required is not a good thing to go off of (except on a
brand new install).   

I dont know how spam_count and ham_count in bayes_vars table can accurately
represent your token distribution....  so having a min spam and min ham token
count would be optimal.   As you can see, bayes_vars says its learned 93k ham,
but out of that, there are currently only 137 ham toks in the table.

mysql> select spam_count,ham_count,token_count from bayes_vars;
+------------+-----------+-------------+
| spam_count | ham_count | token_count |
+------------+-----------+-------------+
|    2463944 |     93579 |     3620764 |
+------------+-----------+-------------+

mysql> select sum(spam_count), sum(ham_count) from bayes_token;
+-----------------+----------------+
| sum(spam_count) | sum(ham_count) |
+-----------------+----------------+
|        34846107 |            137 |
+-----------------+----------------+
1 row in set (0.00 sec)

I propose the following addition to bayes config options..

bayes_min_ham_tokens   <num>
bayes_min_spam_tokens  <num>

and maybe something extra on top of that like ham to spam token ratio.

bayes_ham_spam_token_ratio  0.5  # require 1 ham for every 2 spam tokens

Now i realize that the query is more expensive to sum(spam_count) and
sum(ham_count) on bayes_token table, so I think the bayes_var table could add a
couple cols after `token_count`, say `spam_token_count` and `ham_token_count`,
and auto-expiry/sa-learn could update those fields when it runs.  That way the
query to pull those counts remains efficient.  Also, if those 2 fields are
available, calculating the ham:spam ratio is cake.

if this sounds like an acceptable solution, i may work on it.   unless anyone
has reasons why its not better to do it this way?








Comment 3 Justin Mason 2006-04-04 13:15:21 UTC
+1 it sounds sane to me.
Comment 4 Dallas Engelken 2006-06-20 21:31:52 UTC
i'll have a POF ready soon on this...  just finishing testing.   in the last 4 
hours, my ham:spam token ratio has dropped by a quarter point (trying to get 
under 2:1), so its working as expected :)

2006-06-20 12:06:54.668510500  [18515] dbg: bayes: ham:spam token ratio 
(2.84:1), min ratio (0.5:1), max ratio (2:1) 

2006-06-20 16:28:01.666901500  [10881] dbg: bayes: ham:spam token ratio 
(2.59:1), min ratio (0.5:1), max ratio (2:1)
Comment 5 Dallas Engelken 2006-06-20 21:33:02 UTC
excuse the typo.  thats POC. not POF :)
Comment 6 Dallas Engelken 2006-07-05 15:42:00 UTC
Created attachment 3567 [details]
proof of concept

this patch implements token tracking to prevent issues where lots of ham/spam
has been learned, but all the tokens have been expired for one, causing bayes
to lean too far one way ie (BAYES_00 or BAYES_99 on all mail).

it also implements ham:spam ratio restrictions, which will prevent the
autolearner from learning too much ham when the ratio is high, and too much
spam with the ratio is low.

the proof of concept code only applies to the BayesStore/SQL.pm, so in order to
test it, you'd need to be using

bayes_store_module Mail::SpamAssassin::BayesStore::SQL

since my box that i'm testing here learns alot of spam, and little ham, the
token ratio is always on the bottom end of the min ratio.

[12005] dbg: bayes: ham:spam token ratio (0.74:1), min ratio (0.75:1), max
ratio (1.25:1)
[12005] dbg: bayes: skip autolearn of spam because ham:spam token ratio (0.74)
is less than min ratio (0.75)

as you can see from the autolearn results, its skipped a bunch of spam learns
today... 

# grep -c autolearn=ham spamd.log
652
# grep -c autolearn=spam spamd.log
859
# grep -c autolearn=unavailable spamd.log
5141

but thats because i've set my min/max ratios so close at 0.75-1.25.  If you
want to learn alot more spam, you could simply use 0.5-2.0 which is the
default... or you could even lower that 0.5 to something like 0.25 if you want
to learn up to 4x more spam than ham.

realize that this code is not drop in ready, as it requires a couple SQL alters
to track spam/ham token counts.

ALTER TABLE bayes_vars ADD spam_token_count int(11) NOT NULL default '0' AFTER
token_count;
ALTER TABLE bayes_vars ADD ham_token_count int(11) NOT NULL default '0' AFTER
spam_token_count;
Comment 7 Justin Mason 2006-07-05 16:29:34 UTC
I like it overall, I think -- esp if it fixes the problem! -- but:

1. bayes_min_ham_tokens/bayes_min_spam_tokens -- I'd prefer to leave those out
unless they're needed;

2. it obviously needs to be implemented for the other BayesStore backends;

3. Michael, what's your take on the changes to the BayesStore APIs?  is that
safe?  (it's not a public plugin API, but still).  

in particular, changing the number of returned items for nspam_nham_get() is a
big change; adding an additional (and separate) API would be better.

4. I think there's probably some SQL doc that would also need changing.
Comment 8 Tom Schulz 2006-07-05 17:56:33 UTC
Would it make sense to have the expire code also pay attention to the ratio
and extend the allowed age of the smaller group of tokens if the ratio is
getting out of whack.  You would what to limit how much the allowd age of a
token could be extended (perhaps double the normal value).
Comment 9 Dallas Engelken 2006-07-05 18:20:14 UTC
(In reply to comment #8)
> Would it make sense to have the expire code also pay attention to the ratio
> and extend the allowed age of the smaller group of tokens if the ratio is
> getting out of whack.  You would what to limit how much the allowd age of a
> token could be extended (perhaps double the normal value).

i had it calculating the ratio in token_expiration() and skipping expire if the 
ratio was out of whack, but i ended up pulling it because it seemed to help get 
the ratio back on track faster.  YMMV.
Comment 10 Dallas Engelken 2006-07-05 18:25:39 UTC
Created attachment 3568 [details]
small fix

somehow i got an 'i' stuck in there on line 572 that shouldnt have been there.
Comment 11 Dallas Engelken 2006-07-05 18:37:45 UTC
(In reply to comment #7)
> I like it overall, I think -- esp if it fixes the problem! -- but:
> 1. bayes_min_ham_tokens/bayes_min_spam_tokens -- I'd prefer to leave those out
> unless they're needed;

bayes_min_ham_tokens/bayes_min_spam_tokens  is what fixes this bug to begin 
with.  the token ratio check was just something extra to help the learner learn 
what we need more of and not what we dont.

having bayes_min_(ham|spam)_tokens ensures we dont learn when we dont have 
enough token data.  bayes_min_(ham|spam)_num does not assure us of this, as 
expiry could knock off alot of the token data and the (ham|spam)_count in 
bayes_vars does not account for that.    i've seen 200+ ham learned where 
actual ham token count in bayes_token is very small.

if you dont do this, and opt just for the token ratio path,  then you'd need to 
have ratio logic in is_scan_available() to skip bayes when the ratio is whacked 
out...  because right now, the ratio logic is only being applied to learn(), in 
hopes it will help equalize it.







Comment 12 Henrik Krohns 2019-06-24 16:31:51 UTC
Closing old stale bugs. Seems resolved..