Bug 7943 - TxRep gives nonsensical scores?
Summary: TxRep gives nonsensical scores?
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.4.6
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-12 03:03 UTC by Matija Nalis
Modified: 2021-11-12 03:48 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Matija Nalis 2021-11-12 03:03:58 UTC
TxRep seems to return nonsensical scores. I'm using MySQL table if it matters (as DB files have long ago become unusable to me due to heavy locking & timeouts).

I've finally taken some time to try to debug it, and first issue was that 3.4.6 was generating many same MSGID tokens ("da39a3ee5e6b4b0d3255bfef95601890afd80709@sa_generated" had count>10 in a few minutes), which would then get reused by ham and spam because "that mail was already seen".

(I've partially tracked that problem down to the with how sha1 hash for "xxxxxx@sa_generated" is created in 3.4.6 - TxRep was using "Mail::SpamAssassin::Plugin::Bayes->get_msgid()" which seems to be  case-sensitive and only works for one case of "Message-Id", otherwise it tries to fall back to using hash of date/body but...) 

Anyway I've seen SVN trunk has changed that part of the code, so I've simply disabled MSGID tokens with "txrep_track_messages 0" and truncated the txrep table, hoping that would solve the issue. It did not - it still returned strange results (spammy score for hams etc.)

I've then tried getting SVN trunk TxRep.pm version, with no luck (it still worked wrong, and I've had to copy new generate_msgid() to make it work)

I've then nuked the txrep table; added some debug, and start feeding one clearly ham e-mail several times through "spamassassin -L -t". This is how mysql table looked for first 5 runs (I'm only focusing on EMAILIP tag here, but the same problem is with others):

        +----------+---------------+------+----------+----------+----------+---------------------+
        | username | email         | ip   | msgcount | totscore | signedby | last_hit            |
        +----------+---------------+------+----------+----------+----------+---------------------+
1st     | amavis   | hepi@hep.hr   | none |        1 |   -10.21 | spf      | 2021-11-12 03:07:03 |
2nd     | amavis   | hepi@hep.hr   | none |        2 |   -10.21 | spf      | 2021-11-12 03:09:27 |
3rd     | amavis   | hepi@hep.hr   | none |        3 |   -10.21 | spf      | 2021-11-12 03:10:24 |
4th     | amavis   | hepi@hep.hr   | none |        4 |   -10.21 | spf      | 2021-11-12 03:11:17 |
5th     | amavis   | hepi@hep.hr   | none |        5 |   -10.21 | spf      | 2021-11-12 03:12:54 |

I've added following debug just after:
 $delta = ($self->total() + $msgscore) / (1 + $self->count()) - $msgscore;

dbg("TxRep:   mn %s _formula delta = (total()=%0.3f + msgscore=%0.3f) / (1 + count()=%0.3f) - msgscore=%0.3f = %0.3f", $tag_id, $self->total(), $msgscore, $self->count(), $msgscore, $delta);


And this is what it printed for that first 5 runs:
dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1 + count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) / (1 + count()=1.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) / (1 + count()=2.000) - msgscore=-10.210 = 3.403
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) / (1 + count()=3.000) - msgscore=-10.210 = 5.105
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) / (1 + count()=4.000) - msgscore=-10.210 = 6.126

This looks wrong. I've started with TXREP=0 SA score, and after receiving 5 HAM messages from that sender, TXREP now returns high positive SPAM score:
 3.1 TXREP                  TXREP: Score normalizing based on sender's reputation

The more HAM I feed it, the higher the SPAM score gets.

I'm thinking $delta is supposed to get slightly more negative with each HAM that passes through, or at least remain the same, and definitely not start classifying the email as SPAM. Is my assumption correct? Any idea how $delta calculation should actually work here?
Comment 1 Matija Nalis 2021-11-12 03:48:07 UTC
One observation: it seems that  "totscore" is not always being changed while "msgcount" is. Should it have been?
Because, if it were changed at the same rate, then that formula *would* keep delta at zero, e.g.:

dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1 + count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) / (1 + count()=1.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-20.420 + msgscore=-10.210) / (1 + count()=2.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-30.630 + msgscore=-10.210) / (1 + count()=3.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-40.840 + msgscore=-10.210) / (1 + count()=4.000) - msgscore=-10.210 = 0.000


I've seen in code that calling add_score()  is sometimes connected to (non-default) "txrep_autolearn 1". Enabling autolearn does indeed make "totscore" change, but in a wrong way too, and also "msgcount" gets increased by 2 instead of by 1. The miscalculation leading from ham to spam is still there, even with autolearn enabled though:

+----------+---------------+------+----------+----------+----------+---------------------+
| username | email         | ip   | msgcount | totscore | signedby | last_hit            |
+----------+---------------+------+----------+----------+----------+---------------------+
| amavis   | hepi@hep.hr   | none |        2 |   -30.21 | spf      | 2021-11-12 04:41:52 |
| amavis   | hepi@hep.hr   | none |        4 | -23.4033 | spf      | 2021-11-12 04:43:22 |
| amavis   | hepi@hep.hr   | none |        6 |  -22.042 | spf      | 2021-11-12 04:43:58 |
| amavis   | hepi@hep.hr   | none |        8 | -21.4586 | spf      | 2021-11-12 04:44:30 |
| amavis   | hepi@hep.hr   | none |       10 | -21.1344 | spf      | 2021-11-12 04:44:59 |




dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1 + count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-30.210 + msgscore=-10.210) / (1 + count()=2.000) - msgscore=-10.210 = -3.263
dbg: TxRep: mn EMAILIP _formula delta = (total()=-23.403 + msgscore=-10.210) / (1 + count()=4.000) - msgscore=-10.210 = 3.487
dbg: TxRep: mn EMAILIP _formula delta = (total()=-22.042 + msgscore=-10.210) / (1 + count()=6.000) - msgscore=-10.210 = 5.603
dbg: TxRep: mn EMAILIP _formula delta = (total()=-21.459 + msgscore=-10.210) / (1 + count()=8.000) - msgscore=-10.210 = 6.691

 3.3 TXREP                  TXREP: Score normalizing based on sender's reputation