Bug 6152 - Suggested improvement to INVALID_DATE_TZ_ABSURD rule
Summary: Suggested improvement to INVALID_DATE_TZ_ABSURD rule
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other All
: P2 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-10 05:02 UTC by Steve Freegard
Modified: 2009-07-17 02:06 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Freegard 2009-07-10 05:02:22 UTC
In 20_head_tests.cf - currently:

header INVALID_DATE_TZ_ABSURD	Date =~ /[-+](?:1[4-9]\d\d|[2-9]\d\d\d)$/

This allows +1300, -0510 or -0503 which are all still absurd offsets.

Suggested replacement rule:

header INVALID_DATE_TZ_ABSURD   Date =~ /[+-](?:[2-9]\d\d\d|1[3-9]\d\d|\d\d[1-2|4-9]\d|\d\d\d[1-9])/

This only allows -1200 to +1200 but allows for half-hour offsets like +1030.

Cheers,
Steve.
Comment 1 Matt Kettler 2009-07-10 05:15:08 UTC
What about Kathmandu/Nepal time, where the time zone is officially UTC +5:45?

Chatham Islands also has an odd offset (UTC +12:45)
Comment 2 Steve Freegard 2009-07-10 05:32:47 UTC
(In reply to comment #1)
> What about Kathmandu/Nepal time, where the time zone is officially UTC +5:45?
> 
> Chatham Islands also has an odd offset (UTC +12:45)

Ok - revised rule to handle nn45 and nn15; just in case....

header INVALID_DATE_TZ_ABSURD   Date =~
/[+-](?:[2-9]\d\d\d|1[3-9]\d\d|\d\d[2|5-9]\d|\d\d[1|4][0-4|6-9]|\d\d0[1-9])/
Comment 3 Justin Mason 2009-07-10 05:45:25 UTC
checked in for testing:

: 3...; svn commit -m "bug 6152: updated" rulesrc/sandbox/jm/20_bug6152.cf 
Sending        rulesrc/sandbox/jm/20_bug6152.cf
Transmitting file data .
Committed revision 792929.
Comment 4 Karsten Bräckelmann 2009-07-10 10:11:27 UTC
(In reply to comment #0)
> This allows +1300, -0510 or -0503 which are all still absurd offsets.

UTC +1300 is an entirely valid offset in New Zealand and far east parts of Russia during daylight saving time. Also valid for Kiribati and Tonga all year round. One of a few references I just checked is [1].

Even more strange, UTC +1400 apparently is a valid offset too, in Kiribati. *shrug*  Let's hope they never will adopt daylight saving time... ;)

[1] http://en.wikipedia.org/wiki/UTC+13
Comment 5 Karsten Bräckelmann 2009-07-10 10:25:55 UTC
(In reply to comment #2)
> \d\d[2|5-9]\d

In a char class, a pipe is not an alternation but a pipe char...

> \d\d0[1-9]

And here we're missing thirty-odd offsets other than xx30. The 0 should be [03] instead.
Comment 6 Karsten Bräckelmann 2009-07-10 10:30:16 UTC
Also we lost the $ anchoring at the end of the header.  Sorry for the noise. :)
Comment 7 Karsten Bräckelmann 2009-07-10 11:10:02 UTC
OK, more constructive comments. A better version should be this RE:

  /[-+](?:[2-9]\d\d\d|1[5-9]\d\d|\d\d(?:[25-9]\d|[03][^0]|[14][^5]))$/

Unless negative look-ahead assertions do have a significant performance impact, we could even do it the other way round and actually define what we consider to be a sane offset. Like this.

  /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/

This looks out for a four digit offset, that does not match the sane offsets defined in the leading (?! ) part. Probably better comprehensible, apart from the reversed logic. ;)

This one is my proposal.

Being slightly more anal, cutting off at +1400, not allowing 14-odd fractions either, would just bloat the RE and isn't worth it IMHO. Same for differentiating further between positive and negative possible offsets.
Comment 8 Steve Freegard 2009-07-10 15:17:23 UTC
(In reply to comment #7)
> Unless negative look-ahead assertions do have a significant performance impact,
> we could even do it the other way round and actually define what we consider to
> be a sane offset. Like this.
> 
>   /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/
> 
> This looks out for a four digit offset, that does not match the sane offsets
> defined in the leading (?! ) part. Probably better comprehensible, apart from
> the reversed logic. ;)
> 
> This one is my proposal.
>

Thanks for all the feedback.  The version above is much easier to read.  My only comment would be to remove the $ anchor as the offset isn't always at the end of the date header - consider these examples that I just pulled out from my spamtrap mailbox:

Date: Tue, 12 May 2009 07:30:19 -0700 (PDT)
Date: Thu, 14 May 2009 18:48:45 +0000 (GMT+00:00)

Although I guess you could handle these cases by add changing the end of the regexp:

\d{4}(?:\s\(\S+\))?$/

> Being slightly more anal, cutting off at +1400, not allowing 14-odd fractions
> either, would just bloat the RE and isn't worth it IMHO. Same for
> differentiating further between positive and negative possible offsets.

Yeah; definitely agree - your proposed regexp is good enough and considerably better than the current rule without adding unnecessary bloat.
Comment 9 Justin Mason 2009-07-16 14:56:53 UTC

looks good:

MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE

    0    0.2133   0.0207   0.912    0.75    0.01  T_INVALID_DATE_TZ_ABSURD_BUG6152  
    0    0.0001   0.0207   0.005    0.47    0.20  INVALID_DATE_TZ_ABSURD  

(that's the header INVALID_DATE_TZ_ABSURD   Date =~ /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/ variant).

svn commit -m "bug 6152: update INVALID_DATE_TZ_ABSURD, thanks to Steve Freegard" rulesrc/sandbox/jm/20_bug6152.cf rules/20_head_tests.cf
Sending        rules/20_head_tests.cf
Deleting       rulesrc/sandbox/jm/20_bug6152.cf
Transmitting file data .
Committed revision 794884.
Comment 10 Steve Freegard 2009-07-16 17:11:48 UTC
> MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE 
>     0    0.2133   0.0207   0.912    0.75    0.01 
> T_INVALID_DATE_TZ_ABSURD_BUG6152  

> (that's the header INVALID_DATE_TZ_ABSURD   Date =~
> /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/ variant).

Might also be worth trying another test rule without the ending $ - as I've seen a few trap hits lately ending with "+0710 (PST)".

Also; are these messages in your corpus definitely ham:

/local/cor/recent/ham/priv.wall.200812172200.mbox.511930
/local/cor/recent/ham/priv.wall.200901300200.mbox.1848790
/local/cor/recent/ham/priv.wall.200902250200.mbox.8213512

If you can post the date headers from these (if they are ham) then we might be able to workaround them depending on how broken they are.

Cheers,
Steve.
Comment 11 Karsten Bräckelmann 2009-07-16 17:34:30 UTC
(In reply to comment #9)
> (that's the header INVALID_DATE_TZ_ABSURD   Date =~
> /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/ variant).
> 
> svn commit -m "bug 6152: update INVALID_DATE_TZ_ABSURD, thanks to Steve
> Freegard" rulesrc/sandbox/jm/20_bug6152.cf rules/20_head_tests.cf

Hey! ;)

(In reply to comment #10)
> Might also be worth trying another test rule without the ending $ - as I've
> seen a few trap hits lately ending with "+0710 (PST)".

Agreed, see comment 8. Although the regex should be sufficiently tight, that *anything* at all after that is worth scoring.

> Also; are these messages in your corpus definitely ham:

That's Justin's corpus.

> If you can post the date headers from these (if they are ham) then we might be
> able to workaround them depending on how broken they are.

I'd also be interested in Daryl's FPs on this rule. Daryl?
Comment 12 Justin Mason 2009-07-17 02:06:54 UTC
(In reply to comment #11)
> (In reply to comment #9)
> > (that's the header INVALID_DATE_TZ_ABSURD   Date =~
> > /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}$/ variant).
> > 
> > svn commit -m "bug 6152: update INVALID_DATE_TZ_ABSURD, thanks to Steve
> > Freegard" rulesrc/sandbox/jm/20_bug6152.cf rules/20_head_tests.cf
> 
> Hey! ;)

oops ;) sorry.  hey, you should have just checked it in to your own sandbox! ;)

> 
> (In reply to comment #10)
> > Might also be worth trying another test rule without the ending $ - as I've
> > seen a few trap hits lately ending with "+0710 (PST)".
> 
> Agreed, see comment 8. Although the regex should be sufficiently tight, that
> *anything* at all after that is worth scoring.

ok. will try that again.

header BUG6152_INVALID_DATE_TZ_ABSURD   Date =~ /[-+](?!(?:0\d|1[0-4])(?:[03]0|[14]5))\d{4}/



> > Also; are these messages in your corpus definitely ham:
> 
> That's Justin's corpus.

one guy, on one public list, http://taint.org/x/2009/fps_bug6152.txt :

Date: Wed, 17 Dec 2008 23:01:06 +1559
Date: Thu, 29 Jan 2009 20:53:54 +1859
Date: Tue, 24 Feb 2009 00:52:32 +1859