Bug 5185 - Bayesian learning uses different message checksums during exiscan_acl and later sa_learn
Summary: Bayesian learning uses different message checksums during exiscan_acl and lat...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.1.7
Hardware: Other other
: P5 blocker
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-13 16:32 UTC by Richard van der Hoff
Modified: 2019-08-09 12:01 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Prelimary Patch to remove the received header from the msg_id sha1 calculation patch None Kevin A. McGrail [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Richard van der Hoff 2006-11-13 16:32:49 UTC
This is a bit of a crossover bug between spamassassin and exim.

I'm using the exiscan support to do my spam scanning at SMTP time with exim.
This works well, however it seems that message checksums used to establish
whether or not a message has already been learnt is different between scanning
at SMTP time and scanning after delivery.

The implication is that should a SMTP-time scan result in an autolearn which I
later wish to revert (or, more often, run a learning process over a batch of
messages without relearning those which have already been autolearnt), the
message is always just relearnt.

After a considerable amount of digging, it appears this is due to a difference
in the top Received header. From
http://www.exim.org/exim-html-4.63/doc/html/spec_html/ch43.html#id2689094:

       Once a message is accepted, the timestamp in the Received: header line is
changed to the time of acceptance ...

Thoughts from the spamassassin community would be appreciated on whether:
(a) Bayes.pm should ignore the timestamp on the Received: header when
calculating checksums
(b) exim shouldn't be rewriting the timestamp between the first pass and delivery
(c) what I'm doing is silly/it doesn't matter anyway.
Comment 1 Theo Van Dinter 2006-12-06 13:18:35 UTC
I vote for B.  If a Received header exists, it shouldn't be modified afterwards.
Comment 2 Daryl C. W. O'Shea 2006-12-06 14:12:47 UTC
The problem is the received header doesn't really exist... when scanned at SMTP
time the header SA sees is the one forged by the mail filter interface, not the
one actually added by the MTA.

I know nothing about how the checksum is generated... if this is an issue, does
the later addition or more Received headers, or other headers, affect the
checksum too?
Comment 3 Theo Van Dinter 2006-12-06 14:18:10 UTC
(In reply to comment #2)
> I know nothing about how the checksum is generated... if this is an issue, does
> the later addition or more Received headers, or other headers, affect the
> checksum too?

it depends.  if there is only 1 received header, then yes, the addition matters.
 basically the message id is a sha1 of the date, the first received header (the
bottom one), and a certain amount of the body.

so if there isn't a received header ...  there should be!  lots of things will
break if there is no received header.  it sounded like there was a header added,
then filtering, then the header is modified, but I could be wrong.
Comment 4 Justin Mason 2006-12-07 03:15:29 UTC
I see the problem.  When SpamAssassin is integrated with an MTA with a
milter-type layer, and allowed to reject messages as part of the SMTP
transaction before the message is filed to the delivery spool, this can arise.

here's the timeline:

1. spammer connects to SMTP, issues HELO,MAIL,RCPT,DATA.  crucially,
they do _not_ forge any additional Received headers of their own.
The message just looks like this:

    From: spammer <...>
    To: whoever <...>
    Message-Id: <...>
    Subject: make money fast

    blah blah

2. MTA passes message to spam-checking milter.  milter (note: not MTA)
synthesises a Received header with the SMTP-session metadata, as we suggest:

    Received: from helo (rdns [remoteip]) by MTA for scanning
    From: spammer <...>
    To: whoever <...>
    Message-Id: <...>
    Subject: make money fast

    blah blah

3. SpamAssassin scans message, and autolearns, using that synthetic Received
header as part of the hashed data for the bayes_seen ID.

4. SpamAssassin passes results back to milter.  milter decides to deliver
message.  Milter returns "DELIVER" (or whatever) to MTA, which takes over.

5. MTA generates a "real" delivery header:

    Received: from helo (rdns [remoteip]) by MTA with ESMTP id kAFKfrDH031427;
Wed, 15 Nov 2006 12:41:53 -0800 (PST)
    From: spammer <...>
    To: whoever <...>
    Message-Id: <...>
    Subject: make money fast

    blah blah


note the addition of the real ID string representing the delivery.

One possible fix would be to "reserve" an ID string at step (2)
and ensure the synthetic string matches the string generated by the
"real" MTA code in step (5).   However that may require major
changes to how MTAs work... most SpamAssassin integration layers
can't mandate that much change to the MTA.   Also, there'd need
to be additional code added to revert that change if the message
is rejected at step (4), instead of delivered.

in my opinion, there's a bug here we need to fix, alright.  We can't
expect MTAs to make those changes...

Comment 5 Justin Mason 2007-02-21 12:06:21 UTC
pushing out to 3.3.0, since I don't think it's a 3.2.0 blocker. shout (or change
the milestone) if you disagree....
Comment 6 Richard van der Hoff 2007-02-21 14:17:12 UTC
It's a shame, and I'd love to see this fixed, but no, I'd have to agree that it
shouldn't block a release.
Comment 7 Justin Mason 2010-01-27 02:31:36 UTC
moving some 3.3.0-targeted bugs into the vague Future.  feel free to retarget to 3.3.1 if you think you'll be able to work on them
Comment 8 Justin Mason 2010-01-27 03:16:48 UTC
reassigning, too
Comment 9 Kevin A. McGrail 2012-01-17 16:07:58 UTC
Comment 4 is extremely well written about the issue and I am intimately familiar with this type of scenario.

Here's a shorter version of the issue:

- SA generates the message checksum from the received header timestamp and mail id.

- In a milter environment using a synthesized receiver header to call SpamAssassin, this causes the same message to get two checksums because the timestamp is different and (I think) the mail id is different than what the email will get when the MTA really delivers the email.

- This annoys bayes people trying to unlearn a message because the checksums don't match.  You can also ending up with "duplicates" trying to do batch processing on the same message later.

If the milter can get access to the real message ID at the time it synthesizes the Received header, than perhaps we can create message checksums using only the day and the message ID.  

My theory is that this will produce a checksum that is distinct enough to be unique, reduce emails with two different checksums to those emails received very close to midnight that cross the day barrier between milter scanning and actual MTA delivery.

If the milters really can't get the real message ID, there may not be a good solution beyond recommending that those using milters not use sa-learn midstream.

I'll email David Skoll with Roaring Penguin and ask his input.  He's a milter expert and might know if we can get the real message id in time for the synthesized header for sendmail/postfix/exim/etc.

Regards,
KAM
Comment 10 Richard van der Hoff 2012-01-17 16:32:34 UTC
Thanks Kevin. This one still annoys me from time to time, so it would be nice to have some progress. Rethinking how checksums are generated sounds like a good idea.

By the way, comment 3 and comment 4 both suggest this will only affect messages with no Received header. I'm pretty sure that's not the case.
Comment 11 Kevin A. McGrail 2012-01-17 17:54:33 UTC
DFS made a good point and I think we need to change the checksum calculation.

3.4.0 is the right time to do this because it will invalidate current checksums so we can give people good notice of the issue.

Moving this to a blocker.

Right now, we use the last received.  Let's drop that entirely as he suggests:

Based on the comments:

  # Use sha1_hex(Date:, last received: and top N bytes of body)
  # where N is MIN(1024 bytes, 1/2 of body length)

and the code says:

  unshift(@msgid, sha1_hex($date."\000".$rcvd."\000".$body).'@sa_generated');

so instead use:

  unshift(@msgid, sha1_hex($date . "\000" . $body) . '@sa_generated');

Richard, any chance you could test with this change ASAP on your production machine if we got you a patch?

Regards,
KAM
Comment 12 Richard van der Hoff 2012-01-17 17:57:49 UTC
(In reply to comment #11)
> Richard, any chance you could test with this change ASAP on your production
> machine if we got you a patch?
> 
Absolutely. Sounds great.
Comment 13 Kevin A. McGrail 2012-01-17 20:27:04 UTC
Created attachment 5036 [details]
Prelimary Patch to remove the received header from the msg_id sha1 calculation

This is a patch against trunk but you could likely just edit Bayes.pm (for example, /usr/local/perl5.14.0/lib/site_perl/5.14.0/Mail/SpamAssassin/Plugin/Bayes.pm) manually and change just the one line:

unshift(@msgid, sha1_hex($date."\000".$body).'@sa_generated');

regards,
KAM
Comment 14 Richard van der Hoff 2012-01-18 00:19:20 UTC
Thanks Kevin. I've installed the patch, but I'm still seeing different message-ids being calculated at SMTP time and at a sbsequent sa_learn. I'm not quite sure why right now, and I'll have to leave it for now, but I'll do some more digging over the next couple of days.
Comment 15 Kevin A. McGrail 2012-01-18 15:47:59 UTC
(In reply to comment #14)
> Thanks Kevin. I've installed the patch, but I'm still seeing different
> message-ids being calculated at SMTP time and at a sbsequent sa_learn. I'm not
> quite sure why right now, and I'll have to leave it for now, but I'll do some
> more digging over the next couple of days.

That would imply content or date headers are changing, I think. Or that get_msgid isn't used in both SA programs...

Let me know what you find out.
Comment 16 Richard van der Hoff 2012-01-22 18:32:47 UTC
Ok, sorry for the delay.

So the problem (now) seems to be that, at SMTP time, the message body is LF-terminated, but when I later access it via IMAP, it's CRLF terminated.

Obviously that's not SA's fault, but certainly the easiest fix from my POV would be to make get_msgid line-terminator-agnostic...
Comment 17 Kevin A. McGrail 2012-01-23 18:04:23 UTC
(In reply to comment #16)
> Ok, sorry for the delay.
> 
> So the problem (now) seems to be that, at SMTP time, the message body is
> LF-terminated, but when I later access it via IMAP, it's CRLF terminated.
> 
> Obviously that's not SA's fault, but certainly the easiest fix from my POV
> would be to make get_msgid line-terminator-agnostic...

I think stripping CR and LF for the purposes of generating the msg id should be fine.

Thoughts on this type of change to get_msgid?

   # Make a copy since pristine_body is a reference ...
   my $body = join('', $msg->get_pristine_body());
+  #Stripping all CR and LF so that testing midstream from MTA and post delivery don't 
+  #generate different id's simply because of LF<->CR<->CRLF changes.
+  $body =~ s/[\r\n]//g;
   if (length($body) > 64) { # Small Body?

regards,
KAM
Comment 18 Richard van der Hoff 2012-02-12 19:21:18 UTC
Sorry (again) for the slow response.

(In reply to comment #17)
> I think stripping CR and LF for the purposes of generating the msg id should be
> fine.
> 
> Thoughts on this type of change to get_msgid?
> 
>    # Make a copy since pristine_body is a reference ...
>    my $body = join('', $msg->get_pristine_body());
> +  #Stripping all CR and LF so that testing midstream from MTA and post
> delivery don't 
> +  #generate different id's simply because of LF<->CR<->CRLF changes.
> +  $body =~ s/[\r\n]//g;
>    if (length($body) > 64) { # Small Body?

Well, that seems fine, and certainly solves my problem. I've applied the patch, and can confirm that I finally have consistent msgids between SMTP time and subsequent relearning :)

However, a few thoughts:
 - stripping out the LFs seems like overkill to me - just removing CRs would do the job
 - $body could be utterly huge, so removing all the CRs and LFs could be expensive - particularly since we know we'll only look at the first 1024 bytes. But given all the other work spamassassin does, perhaps this is a complete non-issue?
Comment 19 Kevin A. McGrail 2012-02-12 19:32:23 UTC
> Well, that seems fine, and certainly solves my problem. I've applied the patch,
> and can confirm that I finally have consistent msgids between SMTP time and
> subsequent relearning :)

Excellent.


> However, a few thoughts:
>  - stripping out the LFs seems like overkill to me - just removing CRs would do
> the job

Between unix and dos and mta handling, to me, it seemed safest to remove all of them and be done with it. 

>  - $body could be utterly huge, so removing all the CRs and LFs could be
> expensive - particularly since we know we'll only look at the first 1024 bytes.
> But given all the other work spamassassin does, perhaps this is a complete
> non-issue?

No, this is a good point. I'll move the cr/lf strip to AFTER the body is reduced so it can be more efficient.


svn commit -m 'Changing the way the msgid is calculated to remove last received and to also remove cr/lf from the body. bug 5185.'  
Sending        lib/Mail/SpamAssassin/Plugin/Bayes.pm
Transmitting file data .
Committed revision 1243302.
Comment 20 RW 2012-02-13 15:24:54 UTC
I'm sorry to come to this after it's committed, but removing the received header from msgid is a very bad idea. 

There needs to be some part of msgid that isn't under the control of spammers, otherwise it's trivial for them to prevent their spam ever being learned. They can generate as many spams with the same msgid as they like, and they can prime the database with an initial dummy high-scoring spam that has no usable tokens in common with the rest.

IMO the best way to handle this is to add some kind of UID to one of the headers (preferably with some control over where it goes). Other than that IMO it's better to put it back as it was.
Comment 21 Kevin A. McGrail 2012-02-13 15:36:14 UTC
(In reply to comment #20)
> I'm sorry to come to this after it's committed, but removing the received
> header from msgid is a very bad idea. 
> 
> There needs to be some part of msgid that isn't under the control of spammers,
> otherwise it's trivial for them to prevent their spam ever being learned. They
> can generate as many spams with the same msgid as they like, and they can prime
> the database with an initial dummy high-scoring spam that has no usable tokens
> in common with the rest.
> 
> IMO the best way to handle this is to add some kind of UID to one of the
> headers (preferably with some control over where it goes). Other than that IMO
> it's better to put it back as it was.

The received header is not available in the mail stream at the time many people are running SA.  Therefore, the synthesized header and the final header don't match which generates two disparate msg_id's which is the problem we are trying to solve.

And since SA is both an API and a filtering software, we can't guarantee how it is used so adding headers is not a solution.

I'm very open to other ideas but just adding the received header back isn't feasible as it will cause issues.

However, the msgid is still generated including a substantial portion of the body of the email.  How can they generate a message ID for an email with differing content as you propose?

Regards,
KAM
Comment 22 RW 2012-02-13 16:55:00 UTC
(In reply to comment #21)

> The received header is not available in the mail stream at the time many people
> are running SA.  Therefore, the synthesized header and the final header don't
> match which generates two disparate msg_id's which is the problem we are trying
> to solve.

I understand the problem I just think it's less objectionable than the solution. The problem doesn't hit everyone, you can work around it by turning off auto-learning, and I doubt it does all that much damage in the first place.

> I'm very open to other ideas but just adding the received header back isn't
> feasible as it will cause issues.

Consider hashing in relays-external to make it harder to exploit. Perhaps you might need to remove some information from the first section if it's not stable between the temporary and permanent header.  
 
> However, the msgid is still generated including a substantial portion of the
> body of the email.  How can they generate a message ID for an email with
> differing content as you propose?

Just start with 1k of untokenizable text, perhaps hidden in mime or html.

Since it's pristine_body I guess it could just be 1024 spaces.
Comment 23 Kevin A. McGrail 2012-02-13 17:09:48 UTC
> > However, the msgid is still generated including a substantial portion of the
> > body of the email.  How can they generate a message ID for an email with
> > differing content as you propose?
> 
> Just start with 1k of untokenizable text, perhaps hidden in mime or html.
> 
> Since it's pristine_body I guess it could just be 1024 spaces.

Perhaps generate the msg_id off of more content than 1024 bytes AND why not strip all whitespace? Is sha1_hex that slow that generating it for the entire message is computationally unwanted?

> Consider hashing in relays-external to make it harder to exploit. Perhaps you
> might need to remove some information from the first section if it's not stable
> between the temporary and permanent header.  

Right now, before the change, I believed we used the most recent received header as essentially nothing more than a "unique" timestamp header because it's the only header that is trustable not to have been mutated.

Since that header isn't reliably available, I feel removing it is best but perhaps as you said, it's either rcvd header w/ 1024 bytes or no rcvd header & all the body (perhaps sans whitespace).

I think if we can get a msg_id that is more unique to the message sans the transport path, it could IMPROVE bayes use.
Comment 24 Michael Parker 2012-02-13 17:11:06 UTC
At the very least it sounds like maybe we should make the behavior configurable, with the default being the previous behavior.
Comment 25 Kevin A. McGrail 2012-02-13 17:13:00 UTC
(In reply to comment #24)
> At the very least it sounds like maybe we should make the behavior
> configurable, with the default being the previous behavior.

I feel we need to aim for a solution that works for everyone as the goal before we add yet another configuration option.  And making a change in msg_id calculation is ideal for a major release like 3.4.0.
Comment 26 Richard van der Hoff 2012-02-26 21:46:18 UTC
A few thoughts on this from me:

(In reply to comment #10)
> By the way, comment 3 and comment 4 both suggest this will only affect 
> messages with no Received header. I'm pretty sure that's not the case.

This, on further inspection, is a lie. We use the earliest Received header, so the MTA's frobbing of the time on the last Received header only matters if there were no other Received headers. I probably looked at a message with no Received headers when I reported this, so I missed the CRLF vs LF issue. I still think both issues need addressing, however.

Anyway:

(In reply to comment #20)
> There needs to be some part of msgid that isn't under the control of spammers,
> otherwise it's trivial for them to prevent their spam ever being learned. They
> can generate as many spams with the same msgid as they like, and they can prime
> the database with an initial dummy high-scoring spam that has no usable tokens
> in common with the rest.

Given that the earliest Received header is most certainly under the control of the spammers, I certainly don't think we've made anything worse in this regard, and whilst what we have now might not be perfect, I think calls to put it back as it was are overstating matters.

Perhaps I'm being dense, but I don't really see how the spammers can use this to their advantage. Is preventing your spams being learnt really that useful?


(In reply to comment #25)
> I feel we need to aim for a solution that works for everyone as the goal before
> we add yet another configuration option.

Agreed. Flexibility is all well and good, but having millions of configuration options makes it really hard for people to get a piece of software working as it should.


(In reply to comment #23)
> I think if we can get a msg_id that is more unique to the message sans the
> transport path, it could IMPROVE bayes use.

Whilst that's true, I have another suggestion.  At the end of the day, we're just trying to uniquely identify a particular message on our server, right? Even if I get two copies of a spam, I can learn them as spam separately, I just want to prevent re-learning each one on subsequent folder scans etc. So how about trying to extract the local message-id from the most recent Received header, rather than all this messing about with checksums etc?
Comment 27 Kevin A. McGrail 2013-06-21 14:50:58 UTC
Leaving msg_id calculation in for 3.4.0 and moving this to 3.4.1.
Comment 28 Kevin A. McGrail 2015-04-13 21:12:15 UTC
Without complaints, the 3.4.0 get_msgid has not had any complaints so considering this resolved.
Comment 29 Ian Turner 2017-07-16 13:11:57 UTC
This is not resolved in 3.4.0. There is still a problem with the newline removal.

The problem is that we take 1/2 the body before removing newlines. Thus we could get a different portion of the body depending on the line endings.

Consider these two messages:
"ABCDEFG\r\n\r\n"
"ABCDEFG\n\n"
They are the same except for newlines but will yield a different result using the current algorithm.
"ABCDEFG\r\n\r\n" -> "ABCDE"
"ABCDEFG\n\n" -> "ABCD"
A similar problem could occur if taking the first kilobyte of message.

The issue only presents itself for certain patterns of messages and newlines, obviously.

Removing the newlines before truncating the message works, but I'm sure there's a more efficient fix available. Perhaps go through the message until reaching 1kb of non-newline characters?

Happy to report this as a new bug if you think this one is too stale.
Comment 30 Richard van der Hoff 2017-08-05 12:41:55 UTC
Yup, Ian's absolutely right. Obviously taking half of the message (or the first 1024) bytes will give you a different amount of text depending on whether you're counting CRs or not.
Comment 31 Kevin A. McGrail 2017-08-09 20:58:29 UTC
Reopening for 3.4.2
Comment 32 Kevin A. McGrail 2018-09-04 14:25:30 UTC
Pushing to 3.4.3
Comment 33 Mark Thomas 2019-02-13 18:15:45 UTC
Test comment after migration. Please ignore. Sorry for the noise.
Comment 34 Mark Thomas 2019-02-13 18:17:26 UTC
Test comment 2 after migration. Please ignore. Sorry for the noise.
Comment 35 Henrik Krohns 2019-08-09 12:01:08 UTC
Fixed in trunk.

- Added Message::get_pristine_body_digest(), Message::get_msgid(), Message::generate_msgid() functions
- Removed Plugin::Bayes::get_msgid() function
- Fix TxRep and Bayes usage of above
- generate_msgid() now uses To + Date + whole LF normalized pristine_body

Sending        trunk/UPGRADE
Sending        trunk/lib/Mail/SpamAssassin/Message.pm
Sending        trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm
Sending        trunk/lib/Mail/SpamAssassin/Plugin/TxRep.pm
Sending        trunk/t/bayesbdb.t
Sending        trunk/t/bayesdbm.t
Sending        trunk/t/bayesdbm_flock.t
Sending        trunk/t/bayessdbm.t
Sending        trunk/t/bayessdbm_seen_delete.t
Sending        trunk/t/bayessql.t
Transmitting file data ..........done
Committing transaction...
Committed revision 1864788.