Bug 7933 - Catch really old mails
Summary: Catch really old mails
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-05 20:07 UTC by jidanni
Modified: 2021-10-06 08:53 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Old mail not detected message/rfc822 None jidanni@jidanni.org [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description jidanni 2021-10-05 20:07:27 UTC
Maybe old dates like: 
Date: Mon, 06 Jul 2020 11:09:58 -0700 (PDT)
should trigger something.

"Hopdelta" says:
Sender                                             Recipient                                          Time                   Delta
Start                                              gmail.com                                          02:09:58 2020/07/07
[127.0.1.1]                                        smtp.gmail.com                                     02:09:58 2020/07/07     0s
PDT                                                mail-wm1-x331.google.com                           02:09:58 2020/07/07     0s
mail-wm1-x331.google.com                           shenron.openstreetmap.org                          02:10:00 2020/07/07     2s
shenron.openstreetmap.org                          100.96.133.195                                     22:53:01 2021/10/05     1s  43m  20h 455d
postfix-inbound-0.inbound.mailchannels.net         pdx1-sub0-mail-mx22.g.dreamhost.com                22:53:02 2021/10/05     1s

Maybe even a 0.1 score would be good.
No I don't know what is old enough: one week, one month, one year?
Maybe separate rules for each.

Also some folks would in fact like to give it a negative score.
Well if there was a rule for it then they could.
Else they would need to make a fancy parser...
Comment 1 Bill Cole 2021-10-05 20:26:27 UTC
(In reply to jidanni from comment #0)
> Maybe old dates like: 
> Date: Mon, 06 Jul 2020 11:09:58 -0700 (PDT)
> should trigger something.

Like the DATE_IN_PAST_* rules? 

Can you provide an example of a message that doesn't hit any of those which you think should be hit by a new rule?
Comment 2 jidanni 2021-10-05 20:50:49 UTC
Created attachment 5755 [details]
Old mail not detected

Why doesn't this trigger

header DATE_IN_PAST_96_XX	eval:check_for_shifted_date('undef', '-96')
describe DATE_IN_PAST_96_XX	Date: is 96 hours or more before Received: date
Comment 3 Bill Cole 2021-10-05 22:01:18 UTC
(In reply to jidanni from comment #2)
> Created attachment 5755 [details]
> Old mail not detected
> 
> Why doesn't this trigger
> 
> header DATE_IN_PAST_96_XX	eval:check_for_shifted_date('undef', '-96')
> describe DATE_IN_PAST_96_XX	Date: is 96 hours or more before Received: date

Good question... 

If I'm reading the code correctly, the reason for this is that there are plausible and parseable Received headers which have times close to the Date header. If I strip out the Received headers from 2020, it triggers that rule.

The comments in the code imply that not using the smallest Date/Received difference resulted in false positives. 

Since DATE_IN_PAST_96_XX and its siblings are fairly strong rules with scores set by the RuleQA process (current scores for DATE_IN_PAST_96_XX: 2.600 2.070 1.233 3.405)  I do not believe it would be polite to users to modify the behavior of the underlying eval function at this point. It currently is a measurement of the apparent delay between message composition and initial submission, not of total transit time. RuleQA shows that metric correlating rather well with spamminess.

It may be useful to add a different test that looks at a more strictly specified date comparison, such as using the last Received header or the last "trusted" Received header instead of the current practice of using the smallest time delta  in a parseable Received header relative to the Date header. That would require a new eval in Plugin/HeaderEval.pm. Whether a measurement of putative total transit time actually correlates either way to ham or spam is anyone's guess. In the sample case, it seems likely to me that the message is not spam, but rather some sort of re-injected mail originally sent to a discussion list.
Comment 4 jidanni 2021-10-05 22:38:00 UTC
Well fine, perhaps change
>   describe DATE_IN_PAST_96_XX  Date: is 96 hours or more before Received: date
<   describe DATE_IN_PAST_96_XX  Date: is 96 hours or more before EARLIEST Received: date

Anyway maybe there should be a
<   describe DATE_IN_PAST_96_XX2 Date: is 96 hours or more before LATEST Received: date
to really catch them all, even if they aren't spam. Perhaps score 0.1
for now.
Comment 5 Loren Wilton 2021-10-06 08:53:04 UTC
I would agree that having one or more checks against he latest received date would be handy. 

I've also seen a few cases were even the latest received date is bogus (I'm not an ISP), so an ability to check against the SA system date would be nice. I trust my own system date.