Bug 8094 - Non balanced bayes ratio in db makes the accuracy plummet
Summary: Non balanced bayes ratio in db makes the accuracy plummet
Status: REOPENED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 4.0.0
Hardware: PC Linux
: P4 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-23 11:33 UTC by Mika Ilmaranta
Modified: 2023-01-01 18:54 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Mika Ilmaranta 2022-12-23 11:33:34 UTC
spamassassin-4.0.0-0.30.svn1903083

Does the SA Bayes implementation assume 50-50 ham-spam ratio?

We have been seeing poor accuracy on systems where ratio is not balanced, but ranging between 97-3 and 83-17.

Is it possible to change that or make an alternative bayes implementation which would consider also the probability according the db ratio of tokens?

Here's an example of such a system where ratio is not in balance.
$ sa-learn --dump magic
--
0.000          0          3          0  non-token data: bayes db version
0.000          0    6184682          0  non-token data: nspam
0.000          0   29523157          0  non-token data: nham
0.000          0    2225793          0  non-token data: ntokens
--
Comment 1 Bill Cole 2022-12-23 18:56:27 UTC
No documented explicit assumption exists in the code regarding the ratio of ham to spam in the Bayes training corpus. I don't believe there has been significant attention to the details of the Bayes implementation in many years however, so it is possible that some assumption is implied in the code and no one has noticed. 

I don't believe we have any data that could confirm or refute a relationship between Bayes accuracy and the ham/spam training ratio. Anecdotally, I just checked 3 systems I work with which do not have discernible Bayes errors and none of them has more than 5% spam in the training DB. 

One known source of Bayes inaccuracy is failure to expire the Bayes DB regularly. Over time, the character of spam evolves and as a result the scores of older tokens are increasingly obsolete. If your Bayes DB is dominated by tokens more than about 2 weeks old, it will not be very accurate. If you use MySQL for the Bayes DB, you may find it necessary to forcibly expire the DB, particularly on an active server. It is also possible to damage the accuracy of the Bayes DB by improper training, especially by use of the 'autolearn' feature of SA or learning user-identified ham/spam without robust oversight. 

We are always open to improved implementations of our existing tactics such as Bayesian analysis and the plugin architecture facilitates creating alternatives. I don't believe that there is anyone currently working on an alternative Bayes implementation, and the place to ask a broader audience about that would be our Developers' mailing list, which is open to the public. I would not expect anyone to take on such a task without a well-defined reproducible (or at least broadly recognized) problem. It also may be helpful to raise this issue with the broader SA community by discussing it on the SpamAssassin Users mailing list, if only to solidify whether others see the same problem. 

Because it is so hard to nail down Bayes problems as due to actual bugs in code, rather than mis-training, the standard response to chronic misfires of the BAYES_* rules is to wipe and retrain the DB with recent hand-classified ham and spam, as it is generally not possible to identify the messages one would need to forget to undo the complex damage that mislearning can cause.  

I am resolving this bug as "works for me" because it does not identify a reproducible error and we are not in a position to replace/refactor the Bayes implementation without a concrete definition of what needs fixing and what could constitute a fix.
Comment 2 Mika Ilmaranta 2022-12-27 14:20:35 UTC
Expire seems to be running regularly.

0.000          0 1672099441          0  non-token data: last expiry atime

We use BDB, but this was easier to find
https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

Is expire really based on token's access time, since there is no ctime available?

Shouldn't expire throw away everything older than two weeks based on ctime, not atime?
Comment 3 Mika Ilmaranta 2023-01-01 12:49:18 UTC
So, basically you are saying that autolearn is broken?

Expire is broken because it uses atime, which problem db reset handles?

For some of us getting enough rational ham-spam samples is impossible due to the nature of our users. With the exception of admins and honeypots, but that seems to be insufficient for bayes to make good decisions.

We are getting 44-10 ratio of bayes false negatives and dropped BAYES_* negative scores to -0.001 just to keep track what is going on. Also altered the window for autolearn:

bayes_auto_learn_threshold_spam 7.0
bayes_auto_learn_treshold_nonspam -2.0

May take some time to see any change.
Comment 4 Kevin A. McGrail 2023-01-01 15:51:43 UTC
> So, basically you are saying that autolearn is broken?

Autolearn just slowly reinforces any bias in your content and learning.  I general recommend disabling it.

> For some of us getting enough rational ham-spam samples is impossible due to
> the nature of our users. With the exception of admins and honeypots, but
> that seems to be insufficient for bayes to make good decisions.

Agreed unfortunately that getting sufficient corpora to properly train your bayesian system is difficult.  

What is the bug here?  This should be discussion on the users list not a bug from what I see.
Comment 5 Mika Ilmaranta 2023-01-01 18:54:53 UTC
We turned bayes_auto_learn_on_error on now to see if it makes eny difference.