Bug 6353 - [review] rule FH_FROMEML_NOTLD is wrong
Summary: [review] rule FH_FROMEML_NOTLD is wrong
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.0
Hardware: Other Linux
: P5 normal
Target Milestone: 3.3.2
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: ready to commit
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-26 09:12 UTC by a16x
Modified: 2011-05-28 18:55 UTC (History)
6 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description a16x 2010-02-26 09:12:32 UTC
Rule "FH_FROMEML_NOTLD" is wrong in "72_active.cf".

Current is:

header   FH_FROMEML_NOTLD       From:addr !~ /\./

This rule checks the "." in the whole address, not in the
part after the "@".
This rule can be bypassed easily.
We got lot of spam from ab.cd@ru

I suggest this version: 

header   FH_FROMEML_NOTLD      From:addr !~ /(.*)@(.*)\.(.*)/
Comment 1 Per Jessen 2010-02-26 10:05:51 UTC
Wouldn't this be more appropriate/accurate:

header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/
Comment 2 a16x 2010-02-27 22:40:28 UTC
>Wouldn't this be more appropriate/accurate:
>header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/

Thanks, it's a good idea!
But it has to be case-insentive:

header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/i
Comment 3 a16x 2010-02-27 23:00:09 UTC
I mean case-insensitive.
Comment 4 mouss 2010-02-28 17:48:58 UTC
(In reply to comment #2)
> >Wouldn't this be more appropriate/accurate:
> >header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/
> 
> Thanks, it's a good idea!
> But it has to be case-insentive:
> 
> header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/i

this will miss
From: john.the.ripper


header   FH_FROMEML_NOTLD      From:addr !~ /\@.*+\./

or something like /\@[-\.0-9a-z]+\./

it would be nice if SA provided someehting like From:domain, which would be empty if there is no '@'. we could then simply use
Comment 5 mouss 2010-02-28 17:49:52 UTC
(In reply to comment #4)
> (In reply to comment #2)
> > >Wouldn't this be more appropriate/accurate:
> > >header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/
> > 
> > Thanks, it's a good idea!
> > But it has to be case-insentive:
> > 
> > header   FH_FROMEML_NOTLD      From:addr !~ /\.[a-z]+$/i
> 
> this will miss
> From: john.the.ripper
> 
> 
> header   FH_FROMEML_NOTLD      From:addr !~ /\@.*+\./
> 
> or something like /\@[-\.0-9a-z]+\./
> 
> it would be nice if SA provided someehting like From:domain, which would be
> empty if there is no '@'. we could then simply use

(argh. TAB doesn't work on the web:)

... we could simply use
   From:domain !~ /\./
Comment 6 a16x 2010-03-01 08:03:06 UTC
I tried Mouss's version:

header   FH_FROMEML_NOTLD      From:addr !~ /\@.*+\./

It seems to be buggy, see result below!

testlinux:/# spamassassin --lint
Feb 28 23:13:07.491 [3009] warn: config: invalid regexp for rule FH_FROMEML_NOTLD: /\@.*+\./: Nested quantifiers in regex; marked by <-- HERE in m/\@.*+ <-- HERE \./
Feb 28 23:13:07.728 [3009] warn: config: warning: description exists for non-existent rule FH_FROMEML_NOTLD
Feb 28 23:13:08.445 [3009] warn: lint: 2 issues detected, please rerun with debug enabled for more information
Comment 7 Per Jessen 2010-03-01 09:08:28 UTC
How about this:

header   FH_FROMEML_NOTLD      From:addr !~ /\@.+\.[a-z]+$/
Comment 8 Henrik Krohns 2010-03-01 09:23:59 UTC
(In reply to comment #5)
>
> ... we could simply use
>    From:domain !~ /\./

Yeah let's bloat the code for a very marginal case, which can easily be handled with re itself. :-)

I've committed to test rules to my sandbox. So let's just wait for results and not beat the horse anymore.

header T_BUG6353_1 From:addr !~ /\@.+\.[a-z]+$/i [if-unset: foo@bar.com]
header T_BUG6353_2 From:addr !~ /\.[a-z]+$/i [if-unset: foo@bar.com]
Comment 9 a16x 2010-03-01 09:49:51 UTC
I tested Per Jessen's code:

header   FH_FROMEML_NOTLD      From:addr !~ /\@.+\.[a-z]+$/i

It's working perfectly!

From address should have a "@" followed by a part before the dot,
and a part after the dot. This code checks it all. It also handles
UPPERcase and lowercase chars, because test is case-insensitive.
Comment 10 Mark Martinec 2010-03-01 11:51:09 UTC
> header FH_FROMEML_NOTLD From:addr !~ /\@.+\.[a-z]+$/i

This regexp is too strict, it does not allow for IDN ccTLD domains
like test.xn--p1ai

http://en.wikipedia.org/wiki/Internationalized_domain_name
http://icann.org/en/topics/idn/fast-track/string-evaluation-completion-en.htm
Comment 11 Per Jessen 2010-03-01 12:36:30 UTC
Good point, let's add a hyphen:

header FH_FROMEML_NOTLD From:addr !~ /\@.+\.[a-z-]+$/i

I guess one issue might be exactly how critical the RE should be about what makes a TLD or not. There are obviously many strings that fit [a-z-]+ without them being TLDs.
Comment 12 Henrik Krohns 2010-03-01 14:14:36 UTC
Ok I really doubt the point here is to test a _valid_ TLD. It would require much more code, which someone is free to create.

I've added this to the mix:

header T_BUG6353_3 From:addr !~ /\@.+?\.[a-z]/i [if-unset: foo@bar.com]
Comment 13 Mark Martinec 2010-03-01 19:16:32 UTC
> Ok I really doubt the point here is to test a _valid_ TLD. It would require
> much more code, which someone is free to create.

Why invite the trouble and ask for another round of a bug report
on "the FH_FROMEML_NOTLD is FP-ing on my Russian domain", when
it can be done right the first time around.

> I've added this to the mix:
> header T_BUG6353_3 From:addr !~ /\@.+?\.[a-z]/i [if-unset: foo@bar.com]

What's the point of if-unset? - the rule description says "E-mail address
doesn't have TLD", and a missing From header field fits the description,
it doesn't have a TLD.

The proposed rule is also missing anchoring, matching addresses
like test@a.b.0.

The .+ should be excluding further @ characters, otherwise
a <"test@foo"@.com> would not get a hit.

So I think either a:
  From:addr !~ /\@[^@]+\.(?:[a-z]{2,}|xn--[a-z0-9]+(?:-[a-z0-9]*)?)$/i

or a simple:
  From:addr !~ /\@[^@]+\.(?:[a-z0-9-]+)$/i

would do, I suppose.
Comment 14 Henrik Krohns 2010-03-01 19:33:58 UTC
(In reply to comment #13)
> > Ok I really doubt the point here is to test a _valid_ TLD. It would require
> > much more code, which someone is free to create.
> 
> Why invite the trouble and ask for another round of a bug report
> on "the FH_FROMEML_NOTLD is FP-ing on my Russian domain", when
> it can be done right the first time around.

If you are referring to IDN, my third rule handles it just fine. What I'm meant with valid is checking the TLD exists. That's a completely another matter.

> > I've added this to the mix:
> > header T_BUG6353_3 From:addr !~ /\@.+?\.[a-z]/i [if-unset: foo@bar.com]
> 
> What's the point of if-unset? - the rule description says "E-mail address
> doesn't have TLD", and a missing From header field fits the description,
> it doesn't have a TLD.

I don't agree. Missing header is again a completely different thing than a "missing TLD". It makes much more sense to have a separate MISSING_FROM rule then, if one doesn't exists.

> The proposed rule is also missing anchoring, matching addresses
> like test@a.b.0.
> 
> The .+ should be excluding further @ characters, otherwise
> a <"test@foo"@.com> would not get a hit.
> 
> So I think either a:
>   From:addr !~ /\@[^@]+\.(?:[a-z]{2,}|xn--[a-z0-9]+(?:-[a-z0-9]*)?)$/i
> 
> or a simple:
>   From:addr !~ /\@[^@]+\.(?:[a-z0-9-]+)$/i
> 
> would do, I suppose.

I'm not saying you are wrong, but we are getting into unnecessary complex territories. This is turning into a VALID_FROM_ADDRESS.

As I somehow dragged myself into this, I've added yet another rules to the mix.

header T_BUG6353_4 From:addr !~ /\@[^@]+\.(?:[a-z]{2,}|xn--[a-z0-9]+(?:-[a-z0-9]*)?)$/i
header T_BUG6353_5 From:addr !~ /\@[^@]+\.(?:[a-z]{2,}|xn--[a-z0-9]+(?:-[a-z0-9]*)?)$/i [if-unset: foo@bar.com]
header T_BUG6353_6 From:addr !~ /\@[^@]+\.(?:[a-z0-9-]+)$/i
header T_BUG6353_7 From:addr !~ /\@[^@]+\.(?:[a-z0-9-]+)$/i [if-unset: foo@bar.com]
header T_MISSING_FROM From =~ /^UNSET$/ [if-unset: UNSET]
Comment 15 Henrik Krohns 2010-03-01 19:49:19 UTC
And never mind my rants, I'm really tired.. Marks rules are of course better, and probably T_BUG6353_7 is the most elegant to go with.
Comment 16 Mark Martinec 2010-03-01 19:57:06 UTC
> And never mind my rants, I'm really tired.. Marks rules are of course better,
> and probably T_BUG6353_7 is the most elegant to go with.

Sorry the (?: ... ) is redundant, I was just stripping down the larger rule,
but failed to go all the way.  It should be: /\@[^@]+\.[a-z0-9-]+$/i
Comment 17 Sidney Markowitz 2010-03-01 23:35:45 UTC
I'm jumping in after just skimming this bug's comments with too much in my inbox to read thoroughly. Please excuse me if this comment is actually irrelevant or redundant... If it makes sense to really test for a valid TLD in this rule, we do have some code that is supposed to check for a valid TLD that can be re-used for this by making it an eval rule. If we are going to do the right thing about IDN ccTLD domains that's where we have to do it correctly anyway.

Is it worth making this an eval rule to make it strict about real TLDs?
Comment 18 Mark Martinec 2010-03-02 00:12:33 UTC
> we do have some code that is supposed to check for a valid TLD that
> can be re-used for this by making it an eval rule. If we are going to do the
> right thing about IDN ccTLD domains that's where we have to do it correctly
> anyway.
> Is it worth making this an eval rule to make it strict about real TLDs?

Can't hurt to explore this possibility. Depending on a size of a
change required, we could then decide which way to go for 3.3.1 .
Comment 19 mouss 2010-03-02 01:00:57 UTC
(In reply to comment #6)
> I tried Mouss's version:
> 
> header   FH_FROMEML_NOTLD      From:addr !~ /\@.*+\./
> 
> It seems to be buggy, see result below!
> 
> testlinux:/# spamassassin --lint
> Feb 28 23:13:07.491 [3009] warn: config: invalid regexp for rule
> FH_FROMEML_NOTLD: /\@.*+\./: Nested quantifiers in regex; marked by <-- HERE in
> m/\@.*+ <-- HERE \./
> Feb 28 23:13:07.728 [3009] warn: config: warning: description exists for
> non-existent rule FH_FROMEML_NOTLD
> Feb 28 23:13:08.445 [3009] warn: lint: 2 issues detected, please rerun with
> debug enabled for more information

sure it's bogus: it has a ".*+". should be
    From:addr !~ /\@.+\./
without the '*'.
Comment 20 mouss 2010-03-02 01:06:25 UTC
(In reply to comment #8)
> (In reply to comment #5)
> >
> > ... we could simply use
> >    From:domain !~ /\./
> 
> Yeah let's bloat the code for a very marginal case, which can easily be handled
> with re itself. :-)
> 


I'd agree not to blow code if you agree not to add bogus rules;-p come on... 

there is no more bloat in providing From:domain than providing From:addr or From:name.
Comment 21 Henrik Krohns 2010-03-03 14:15:09 UTC
http://ruleqa.spamassassin.org/?rule=%2F%28BUG6353%7CMISSING_FROM%29

Pretty much fifty-fifty on what to choose. Actually the simplest rule has slightly better ratio.
Comment 22 Henrik Krohns 2011-05-11 05:03:03 UTC
Seems MISSING_FROM rule looks good, will be committing it.

0.2571 	0.0015 	0.994 	0.65 	T_MISSING_FROM

Here are all the NOTLD variants which do not overlap with MISSING_FROM. I think Mark's no. 5 is the most "robust" one rule wise and ratios are only marginally different.

0.3620 	0.0154 	0.959 	0.67 	T_BUG6353_5
0.3616 	0.0154 	0.959 	0.66 	T_BUG6353_1
0.3119 	0.0154 	0.953 	0.65 	T_BUG6353_7
0.3616 	0.0144 	0.962 	0.67 	T_BUG6353_2
0.3059 	0.0124 	0.961 	0.65 	T_BUG6353_3

Giving this a review since there are lots to choose. Feel free to vote on MISSING_FROM score if you feel like it or have different opinion on the NOTLD rule.
Comment 23 Mark Martinec 2011-05-11 10:34:23 UTC
> Giving this a review since there are lots to choose. Feel free to vote on
> MISSING_FROM score if you feel like it or have different opinion on the NOTLD
> rule.

If you mean this one:
  header T_MISSING_FROM From =~ /^UNSET$/ [if-unset: UNSET]

we already have rules:
  header __HAS_RCVD  exists:Received
  header __HAS_MESSAGE_ID  exists:Message-Id
  header __HAS_DATE  exists:Date
  header __HAS_SUBJECT  exists:Subject
(which really test for existence of a header field, not just
comparing it to some magic reserved string)

so why not add a:
  header __HAS_FROM  exists:From

and than can have a:
  meta T_MISSING_FROM !__HAS_FROM
Comment 24 AXB 2011-05-11 10:41:48 UTC
(In reply to comment #23)
> > Giving this a review since there are lots to choose. Feel free to vote on
> > MISSING_FROM score if you feel like it or have different opinion on the NOTLD
> > rule.
> 
> If you mean this one:
>   header T_MISSING_FROM From =~ /^UNSET$/ [if-unset: UNSET]
> 
> we already have rules:
>   header __HAS_RCVD  exists:Received
>   header __HAS_MESSAGE_ID  exists:Message-Id
>   header __HAS_DATE  exists:Date
>   header __HAS_SUBJECT  exists:Subject
> (which really test for existence of a header field, not just
> comparing it to some magic reserved string)
> 
> so why not add a:
>   header __HAS_FROM  exists:From
> 
> and than can have a:
>   meta T_MISSING_FROM !__HAS_FROM

+1 for the meta
Comment 25 Henrik Krohns 2011-05-11 10:44:38 UTC
(In reply to comment #23)
> > Giving this a review since there are lots to choose. Feel free to vote on
> > MISSING_FROM score if you feel like it or have different opinion on the NOTLD
> > rule.
> 
> If you mean this one:
>   header T_MISSING_FROM From =~ /^UNSET$/ [if-unset: UNSET]
> 
> we already have rules:
>   header __HAS_RCVD  exists:Received
>   header __HAS_MESSAGE_ID  exists:Message-Id
>   header __HAS_DATE  exists:Date
>   header __HAS_SUBJECT  exists:Subject
> (which really test for existence of a header field, not just
> comparing it to some magic reserved string)
> 
> so why not add a:
>   header __HAS_FROM  exists:From

Exactly what I was planning to do, already added it my cf to verify if it has any difference on the unset one :)
Comment 26 John Hardin 2011-05-11 13:03:00 UTC
(In reply to comment #24)
> (In reply to comment #23)
> >
> > so why not add a:
> >   header __HAS_FROM  exists:From
> 
> +1 for the meta

Agreed. +1
Comment 27 Kevin A. McGrail 2011-05-11 15:02:45 UTC
(In reply to comment #26)
> (In reply to comment #24)
> > (In reply to comment #23)
> > >
> > > so why not add a:
> > >   header __HAS_FROM  exists:From
> > 
> > +1 for the meta
> 
> Agreed. +1

+1 as well so I believe you can commit
Comment 28 Kevin A. McGrail 2011-05-28 09:41:20 UTC
Henrik,

Do we have a need for a patch on 3.3/trunk for this or is this something from your sandbox we are expecting to auto-promote?  Just trying to get a clear idea so we can build the next tar for 3.3.2rc.
Comment 29 Henrik Krohns 2011-05-28 09:46:19 UTC
(In reply to comment #28)
> Henrik,
> 
> Do we have a need for a patch on 3.3/trunk for this or is this something from
> your sandbox we are expecting to auto-promote?  Just trying to get a clear idea
> so we can build the next tar for 3.3.2rc.

Yeah I see the exists:From rule works fine now, I'll commit it to main rules today..

Also I'll update the NOTLD variant to Mark's no. 5 proposal, since noone commented on that.
Comment 30 Henrik Krohns 2011-05-28 18:55:21 UTC
Closing this now, but please comment if MISSING_FROM should have a score set?? Will it be autogenerated? In any case even default score of 1 seems fine looking at the ruleqa.


trunk:
Sending        rules/20_head_tests.cf
Sending        rulesrc/sandbox/emailed/00_FVGT_File001.cf
Transmitting file data ..
Committed revision 1128753.

3.3:
Sending        rules/20_head_tests.cf
Sending        rulesrc/sandbox/emailed/00_FVGT_File001.cf
Transmitting file data ..
Committed revision 1128754.