Bug 6817 - email with "Mr" or "Mrs" scored +3.6??
Summary: email with "Mr" or "Mrs" scored +3.6??
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.2
Hardware: PC Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-21 20:33 UTC by Jason Haar
Modified: 2012-12-18 03:27 UTC (History)
8 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Haar 2012-07-21 20:33:49 UTC
Hi there

We just got a bunch of legitimate email marked as spam, and the dominant rule was HK_NAME_MR_MRS from 72_active.cf

I realize a lot of "stranger" email will contain Mr/Mrs - but so does a lot of legit email. Three points for two chars is always a bad idea ;-)
Comment 1 Marc Andre Selig 2012-07-23 23:29:51 UTC
(In reply to comment #0)

> We just got a bunch of legitimate email marked as spam, and the dominant
> rule was HK_NAME_MR_MRS from 72_active.cf
> 
> I realize a lot of "stranger" email will contain Mr/Mrs - but so does a lot
> of legit email. Three points for two chars is always a bad idea ;-)

This rule (or, more specifically, __HK_NAME_MR_MRS which is referenced by it) triggers only if the display name in the From: field starts with "Mr" or "Mrs" or "Miss", followed by a word boundary.

It is common for people to put a name or a role designation in the display name field.  For example, you probably have "Jason Haar" in there, but somebody else might put a role description such as "Acme Sales".  You do *not* have "Mr Jason Haar" as display name in the From: field of your messages.  If you did, this would indeed appear a bit funny.

In short, the rule does not trigger everytime somebody writes "Mr" in his message, but only if he has this word at the beginning of (the display name field of) his From: field.  Apparently, this can distinguish spam from ham in the test corpora.  For most people, it probably works pretty well.

If it does not work for you, just turn it off in local.cf.
Comment 2 Jason Haar 2012-07-24 00:07:58 UTC
thanks for answering

I didn't notice the From: hook. You are correct - the emails in question are from a local school - so "Mr" in the From line is understandable - if non-standard

We have already altered the score down - but now that I know it's related to From: too, I'm going to increase the score and deal with the school via a separate rule

Thanks!

Jason
Comment 3 Jason Haar 2012-07-24 00:16:06 UTC
sorry - I forgot to mark as resolved :-)
Comment 4 Jose Borges Ferreira 2012-12-13 00:14:00 UTC
I would like to re-open this bug because it hits on mr@domain.tld .

Generally speaking this wouldn't be a big issue but this rules scores like "3.797 3.561 3.797 3.561" that are very high.

I can't provide a sample, but is very simple to replicate the bug.
Comment 5 Kevin A. McGrail 2012-12-14 13:20:12 UTC
(In reply to comment #4)
> I would like to re-open this bug because it hits on mr@domain.tld .
> 
> Generally speaking this wouldn't be a big issue but this rules scores like
> "3.797 3.561 3.797 3.561" that are very high.
> 
> I can't provide a sample, but is very simple to replicate the bug.

I think the scores are a bit high for this rule as well.  

Henrik, I've added some 2.0 and 3.0 scores to enforce some masscheck limits to this and the related rules in your sandbox.

Index: rulesrc/sandbox/hege/20_hk.cf
===================================================================
--- rulesrc/sandbox/hege/20_hk.cf       (revision 1421841)
+++ rulesrc/sandbox/hege/20_hk.cf       (working copy)
@@ -128,11 +128,17 @@
 header         __HK_NAME_DR            From:name =~ /^DR\b/mi
 header         __HK_NAME_FROM          From:name =~ /^FROM\b/mi
 meta           HK_NAME_MR_MRS          __HK_NAME_MR_MRS && !FREEMAIL_FROM
+score          HK_NAME_MR_MRS          2.0
 meta           HK_NAME_FM_MR_MRS       __HK_NAME_MR_MRS && FREEMAIL_FROM
+score           HK_NAME_FM_MR_MRS       3.0
 meta           HK_NAME_DR              __HK_NAME_DR && !FREEMAIL_FROM
+score           HK_NAME_DR             2.0
 meta           HK_NAME_FM_DR           __HK_NAME_DR && FREEMAIL_FROM
+score           HK_NAME_FM_DR           3.0
 meta           HK_NAME_FROM            __HK_NAME_FROM && !FREEMAIL_FROM
+score           HK_NAME_FROM            2.0
 meta           HK_NAME_FM_FROM         __HK_NAME_FROM && FREEMAIL_FROM
+score           HK_NAME_FM_FROM         3.0
 
 endif

Running make test and will commit in a little bit.  Feel free to veto!
Comment 6 John Hardin 2012-12-14 15:06:43 UTC
Any value to adding a (?!\@) in there to avoid the "mr@" case?
Comment 7 AXB 2012-12-14 15:16:19 UTC
+1 for that idea
Comment 8 Kevin A. McGrail 2012-12-14 15:18:28 UTC
(In reply to comment #6)
> Any value to adding a (?!\@) in there to avoid the "mr@" case?

I think this would work but is your way more efficient?

From:name =~ /^M(?:RS?|ISS)[^@\b]/mi
Comment 9 Kevin A. McGrail 2012-12-14 15:37:26 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > Any value to adding a (?!\@) in there to avoid the "mr@" case?

Playing with this more, I don't know what we are trying to hit and not hit.

Here's a stub.  I need more examples of what we are trying to hit/not hit to effectively work on this.  For example, a co-worker who I'm going to start getting to help this project pointed out all we need is a period match for the cases I've identified.  He's correct but I think it's because I need more case scenarios.

use strict;

my (@name, $name);

#SHOULD NOT HIT
$name[0] = 'mrfixit@test.com';

#SHOULD HIT
$name[1] = 'mr.FixIt@test.com';

#SHOULD NOT HIT
$name[2] = 'mr@test.com';

foreach $name (@name) {

  print "$name - ";
  print $name =~ /^M(?:RS?|ISS)\.\b/mi;
  print "\n";
}
Comment 10 Marc Andre Selig 2012-12-14 15:46:47 UTC
#SHOULD HIT
$name[3] = 'Mr Paul Guss <mrpaul@bk.ru>';
$name[4] = 'Mr  Johnson Kwame <johnsonkwame@cantv.net>';
$name[5] = '"Mrs elisabeth .Alfredo" <assist_socialpmb@ifi.com.br>';
$name[6] = '"MRS. S. F. GADDAFI" <123miinfo@yippy.com>';
$name[7] = '"Mr. David Freder"<admin@r-e-m.co.za>';
$name[8] = '"Mr.Chamberlain"<audit@senate.gov>';

(from actual spam tagged with HK_NAME_MR_MRS)
Comment 11 John Hardin 2012-12-14 15:49:28 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > Any value to adding a (?!\@) in there to avoid the "mr@" case?
> 
> I think this would work but is your way more efficient?
> 
> From:name =~ /^M(?:RS?|ISS)[^@\b]/mi

That's not syntactically equivalent. Also, I don't know whether \b can even appear in a character class as it's a zero-length assertion.

Absent any testing, something like this:

    /^M(?:RS?|ISS)\b(?!\@)/mi

(In reply to comment #9)

I don't think we want to explicitly require a period, as that would miss something like "MR BOZO FRAUDSTER" which we do want to score.

Also: why is From:name hitting on the from _address_ ? Does :name default to the email address if there is no display name present?

Perhaps this would be better:

    /^M(?:RS?|ISS)\b(?!\S*\@)/mi

...to completely exclude non-display-name From headers where the email address begins with the targeted text.
Comment 12 Henrik Krohns 2012-12-14 15:58:51 UTC
(In reply to comment #11)
>     /^M(?:RS?|ISS)\b(?!\S*\@)/mi
> 
> ...to completely exclude non-display-name From headers where the email
> address begins with the targeted text.

I'd vote for this.
Comment 13 Kevin A. McGrail 2012-12-14 16:11:36 UTC
(In reply to comment #11)
> > From:name =~ /^M(?:RS?|ISS)[^@\b]/mi
> 
> That's not syntactically equivalent. Also, I don't know whether \b can even
> appear in a character class as it's a zero-length assertion.

Agreed.  I wasn't paying enough attention, sorry.

> Also: why is From:name hitting on the from _address_ ? Does :name default to
> the email address if there is no display name present?

I think this is the crux of the issue. That appears to be a bug that was fixed in trunk.  See bug 6354.

I don't see how we can make the rule more friendly to mr@domain.tld unless they are running trunk.  What's the *specific* From that you see so I can confirm it's likely to be resolved already in 3.4?
 
Beyond that, I think the correct answer is to require 3.4 and to lower the scores.
Comment 14 RW 2012-12-14 16:35:28 UTC
(In reply to comment #9)

> #SHOULD HIT
> $name[1] = 'mr.FixIt@test.com';
 
IMO that shouldn't hit.  mr. in an address could just be a jokey address. It's completely different from "Mr John Smith" which is a misplaced attempt at professional formality.
Comment 15 Kevin A. McGrail 2012-12-14 17:46:05 UTC
(In reply to comment #14)
> (In reply to comment #9)
> 
> > #SHOULD HIT
> > $name[1] = 'mr.FixIt@test.com';
>  
> IMO that shouldn't hit.  mr. in an address could just be a jokey address.
> It's completely different from "Mr John Smith" which is a misplaced attempt
> at professional formality.

Yeah, it's taken a bit to think about what should and shouldn't hit.

But if if my theory is correct, the current 3.4 trunk WON'T hit on that because it doesn't have a name portion.

So "Mr. Fix It" <mr.fixit@fixit.com> would hit but <mr.fixit@fixit.com> would not because Name portion is blank.

Not sure if that's still going to reduce FPs so step 1 is lowering the scores.

Step 2 is likely to figure out what are some examples that should and should not hit based on 3.4 and improve the rule.

I'm planning on committing a version based if loop on the rule and lower score ceilings for masscheck soon.  Failed a make test for something else so I'm working on that.

Index: rulesrc/sandbox/hege/20_hk.cf
===================================================================
--- rulesrc/sandbox/hege/20_hk.cf       (revision 1421841)
+++ rulesrc/sandbox/hege/20_hk.cf       (working copy)
@@ -124,15 +124,25 @@
 
 ifplugin Mail::SpamAssassin::Plugin::FreeMail
 
-header         __HK_NAME_MR_MRS        From:name =~ /^M(?:RS?|ISS)\b/mi
+#REQUIRING VERSION 3.4 BECAUSE From:name works improperly prior to that version.
+if (version >= 3.004000)
+  header               __HK_NAME_MR_MRS        From:name =~ /^M(?:RS?|ISS)\b/mi
+  meta            HK_NAME_MR_MRS          __HK_NAME_MR_MRS && !FREEMAIL_FROM
+  score           HK_NAME_MR_MRS          2.0
+  meta            HK_NAME_FM_MR_MRS       __HK_NAME_MR_MRS && FREEMAIL_FROM
+  score           HK_NAME_FM_MR_MRS       3.0
+endif
+
 header         __HK_NAME_DR            From:name =~ /^DR\b/mi
 header         __HK_NAME_FROM          From:name =~ /^FROM\b/mi
-meta           HK_NAME_MR_MRS          __HK_NAME_MR_MRS && !FREEMAIL_FROM
-meta           HK_NAME_FM_MR_MRS       __HK_NAME_MR_MRS && FREEMAIL_FROM
 meta           HK_NAME_DR              __HK_NAME_DR && !FREEMAIL_FROM
+score           HK_NAME_DR             2.0
 meta           HK_NAME_FM_DR           __HK_NAME_DR && FREEMAIL_FROM
+score           HK_NAME_FM_DR           3.0
 meta           HK_NAME_FROM            __HK_NAME_FROM && !FREEMAIL_FROM
+score           HK_NAME_FROM            2.0
 meta           HK_NAME_FM_FROM         __HK_NAME_FROM && FREEMAIL_FROM
+score           HK_NAME_FM_FROM         3.0
 
 endif
Comment 16 RW 2012-12-14 18:58:27 UTC
(In reply to comment #15)
> So "Mr. Fix It" <mr.fixit@fixit.com> would hit but <mr.fixit@fixit.com>
> would not because Name portion is blank.

I have seen a fair amount of legitimate email with something like:

"foo@example"<foo@example.com>
Comment 17 John Hardin 2012-12-14 20:45:33 UTC
(In reply to comment #14)
> (In reply to comment #9)
> 
> > #SHOULD HIT
> > $name[1] = 'mr.FixIt@test.com';
>  
> IMO that shouldn't hit.  mr. in an address could just be a jokey address.
> It's completely different from "Mr John Smith" which is a misplaced attempt
> at professional formality.

+1
Comment 18 John Hardin 2012-12-14 20:54:12 UTC
(In reply to comment #13)
> (In reply to comment #11)
> I don't see how we can make the rule more friendly to mr@domain.tld unless
> they are running trunk.

See comment #12
Comment 19 Jose Borges Ferreira 2012-12-15 04:14:44 UTC
The rule looks good and more safe.

I have 979 samples hitting this rule and all the spam is still hitting.

The ham that stills hits are mostly from a education related mailing list.

Mr Jaime Garcia Salinas <jaime.garciasalinas@uqconnect.edu.au>
"Mr. Debojit Boro" <deb0001@tezu.ernet.in>
Mr Gerhard Bissels <gerhard.bissels@library.coop>
Miss Emily Telford <emily.telford@uqconnect.edu.au>
"Mr. Jamie Pruden" <jpruden@siprep.org>

These matches are within the spirit of the rule and since overall scores are below  2 I think is acceptable.

The only use case that wasn't considered was the usage of latin charset like this
 =?UTF-8?B?TWlzc8OjbyBJdGluZXJhbnRl?= <notification+xxxxxxxx@facebookmail.com>
(This is the raw for "Missão Itinerante") .

Like the above is still an a acceptable FP. Again overall score is still low.

Overall I would give a +1 to the patch.
Comment 20 Phil Randal 2012-12-17 10:25:02 UTC
We also get lots of hits on ham from schools.

They like to be formal, alas.

I'd seriously recommend ditching the rule, or at least scoring it MUCH lower.
Comment 21 Kevin A. McGrail 2012-12-18 03:27:16 UTC
Let's see if this helps now that rules are being published again.

svn commit -m 'Lowered score and added condition for 3.4 because From:name works differently prior - bug 6817'
Sending        rulesrc/sandbox/hege/20_hk.cf
Transmitting file data .
Committed revision 1423268.
[root@devel trunk]#