Bug 7965 - SQL storage backend miscalculates mean score for AWL (and others?)
Summary: SQL storage backend miscalculates mean score for AWL (and others?)
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.4.6
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-27 23:16 UTC by Matija Nalis
Modified: 2022-04-11 06:14 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Diff for SQLBasedAddrList patch None Paul Stead [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Matija Nalis 2022-03-27 23:16:15 UTC
When using AWL with default BerkleyDB storage backend, feeding the same mail over and over again does not change its score from the mean, which is correct behaviour.

However, when I use SQL to store the results instead,this behaviour changes in incorrect way, causing the score to be further and further reduced on each new pass. 

E.g. when nuking AWL storage and starting from scratch, the following mail has score 16.641, and I can feed it many times and the score remains the same with default BerkleyDB storage:

> Mar 27 19:17:57.039 [4227] dbg: auto-whitelist: db-based adminoffice@unist.hr|ip=170.246 scores 0/0
> Mar 27 19:17:57.039 [4227] dbg: auto-whitelist: db-based adminoffice@unist.hr|ip=none scores 0/0
> Mar 27 19:17:57.040 [4227] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: undef, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.c
> Mar 27 19:17:57.040 [4227] dbg: auto-whitelist: add_score: new count: 1, new totscore: 16.641
> Mar 27 19:17:57.043 [4227] dbg: auto-whitelist: post auto-whitelist score: 16.641
> 
> Mar 27 19:18:07.542 [4285] dbg: auto-whitelist: db-based adminoffice@unist.hr|ip=170.246 scores 1/16.641
> Mar 27 19:18:07.542 [4285] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 16.641, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.
> Mar 27 19:18:07.543 [4285] dbg: auto-whitelist: add_score: new count: 2, new totscore: 33.282
> Mar 27 19:18:07.546 [4285] dbg: auto-whitelist: post auto-whitelist score: 16.641
> 
> Mar 27 19:18:18.406 [4342] dbg: auto-whitelist: db-based adminoffice@unist.hr|ip=170.246 scores 2/33.282
> Mar 27 19:18:18.406 [4342] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 16.641, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.
> Mar 27 19:18:18.407 [4342] dbg: auto-whitelist: add_score: new count: 3, new totscore: 49.923
> Mar 27 19:18:18.410 [4342] dbg: auto-whitelist: post auto-whitelist score: 16.641

However when I nuke AWL storage and starting from scratch with SQL backend (by using "auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList"), the following mail which has score 16.641 initially, changes (reduces) the score on each subseqent invocation, e.g.:

> Mar 27 19:12:47.675 [2717] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 0, msgcount 0
> Mar 27 19:12:47.676 [2717] dbg: auto-whitelist: sql-based adminoffice@unist.hr|none scores 0, msgcount 0
> Mar 27 19:12:47.676 [2717] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: undef, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.c
> Mar 27 19:12:47.678 [2717] dbg: auto-whitelist: sql-based add_score/insert score 16.641: amavis|adminoffice@unist.hr|170.246|1|16.641|hotelfrontera.cl
> Mar 27 19:12:47.679 [2717] dbg: auto-whitelist: post auto-whitelist score: 16.641
> 
> Mar 27 19:13:37.620 [2815] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 16.641, msgcount 1
> Mar 27 19:13:37.620 [2815] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 16.641, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.
> Mar 27 19:13:37.623 [2815] dbg: auto-whitelist: sql-based add_score/insert score 16.641: amavis|adminoffice@unist.hr|170.246|1|16.641|hotelfrontera.cl
> Mar 27 19:13:37.624 [2815] dbg: auto-whitelist: post auto-whitelist score: 16.641
> 
> Mar 27 19:13:48.304 [2873] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 16.641, msgcount 2
> Mar 27 19:13:48.304 [2873] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 8.320, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.c
> Mar 27 19:13:48.308 [2873] dbg: auto-whitelist: sql-based add_score/insert score 16.641: amavis|adminoffice@unist.hr|170.246|1|16.641|hotelfrontera.cl
> Mar 27 19:13:48.309 [2873] dbg: auto-whitelist: post auto-whitelist score: 12.481
> 
> Mar 27 19:13:58.940 [2933] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 16.641, msgcount 3
> Mar 27 19:13:58.940 [2933] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 5.547, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.c
> Mar 27 19:13:58.942 [2933] dbg: auto-whitelist: sql-based add_score/insert score 16.641: amavis|adminoffice@unist.hr|170.246|1|16.641|hotelfrontera.cl
> Mar 27 19:13:58.943 [2933] dbg: auto-whitelist: post auto-whitelist score: 11.094
> 
> Mar 27 19:14:09.724 [3031] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 16.641, msgcount 4
> Mar 27 19:14:09.724 [3031] dbg: auto-whitelist: AWL active, pre-score: 16.641, autolearn score: 16.641, mean: 4.160, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.c
> Mar 27 19:14:09.727 [3031] dbg: auto-whitelist: sql-based add_score/insert score 16.641: amavis|adminoffice@unist.hr|170.246|1|16.641|hotelfrontera.cl
> Mar 27 19:14:09.729 [3031] dbg: auto-whitelist: post auto-whitelist score: 10.401

This should not be happening; the score should remain the same regardless if ones uses SQL od BDB storage.

I'm using spamassassin package 3.4.6-1 from Debian Bullseye.

Note: I highly suspect this also affects other learners using SQL, for example TxRep (https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943), and maybe bayes too?
Comment 1 Matija Nalis 2022-03-27 23:39:24 UTC
Ops, that was with current trunk (1899268) version of SQLBasedAddrList.pm, sorry.

The 3.4.6 version of SQLBasedAddrList.pm returns score correctly (albeit clean 3.4.6 does complain a lot about "SQL error: Duplicate entry 'amavis-adminoffice@unist.hr-hotelfrontera.cl-170.246' for key 'PRIMARY'"):

> Mar 28 01:35:05.570 [10739] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 0, msgcount 0
> Mar 28 01:35:05.571 [10739] dbg: auto-whitelist: sql-based adminoffice@unist.hr|none scores 0, msgcount 0
> Mar 28 01:35:05.571 [10739] dbg: auto-whitelist: AWL active, pre-score: 12.692, autolearn score: 12.692, mean: undef, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera.
> Mar 28 01:35:05.573 [10739] dbg: auto-whitelist: sql-based add_score/insert score 12.692: amavis|adminoffice@unist.hr|170.246|1|12.692|hotelfrontera.cl
> Mar 28 01:35:05.573 [10739] dbg: auto-whitelist: post auto-whitelist score: 12.692
> 
> Mar 28 01:35:15.181 [10961] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 12.692, msgcount 1
> Mar 28 01:35:15.181 [10961] dbg: auto-whitelist: AWL active, pre-score: 12.692, autolearn score: 12.692, mean: 12.692, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera
> Mar 28 01:35:15.182 [10961] dbg: auto-whitelist: sql-based add_score/insert amavis|adminoffice@unist.hr|170.246|1|12.692|hotelfrontera.cl: SQL error: Duplicate entry 'amavis-adminoffice@unist
> Mar 28 01:35:15.184 [10961] dbg: auto-whitelist: sql-based add_score/update new msgcount: 2, new totscore: 25.384 for 2|12.692|amavis|adminoffice@unist.hr|hotelfrontera.cl|170.246
> Mar 28 01:35:15.184 [10961] dbg: auto-whitelist: post auto-whitelist score: 12.692
> 
> Mar 28 01:35:24.964 [11234] dbg: auto-whitelist: sql-based adminoffice@unist.hr|hotelfrontera.cl scores 25.384, msgcount 2
> Mar 28 01:35:24.964 [11234] dbg: auto-whitelist: AWL active, pre-score: 12.692, autolearn score: 12.692, mean: 12.692, IP: 170.246.172.8, address: adminoffice@unist.hr signed by hotelfrontera
> Mar 28 01:35:24.966 [11234] dbg: auto-whitelist: sql-based add_score/insert amavis|adminoffice@unist.hr|170.246|1|12.692|hotelfrontera.cl: SQL error: Duplicate entry 'amavis-adminoffice@unist
> Mar 28 01:35:24.968 [11234] dbg: auto-whitelist: sql-based add_score/update new msgcount: 3, new totscore: 38.076 for 3|12.692|amavis|adminoffice@unist.hr|hotelfrontera.cl|170.246
> Mar 28 01:35:24.968 [11234] dbg: auto-whitelist: post auto-whitelist score: 12.692
Comment 2 Paul Stead 2022-04-08 12:53:20 UTC
Created attachment 5766 [details]
Diff for SQLBasedAddrList

Looks as though commit 1872280 didn't take into account that totscore needs to be added to the score, not the score set to the current message score. The attached diff should now account for that.

This is as was initially intended as can be seen by the update on line 345 of SQLBasedAddrList.pm

Can you test the attached diff and see if it works as intended? My testing seems to now keep a proper score in AWL
Comment 3 Henrik Krohns 2022-04-11 06:14:37 UTC
Good catch, it fixes it without a doubt, so committing before it's forgotten. Also improved tests.

Sending        trunk/lib/Mail/SpamAssassin/SQLBasedAddrList.pm
Sending        trunk/t/sql_based_whitelist.t
Transmitting file data ..done
Committing transaction...
Committed revision 1899727.