SA Bugzilla – Bug 6817
email with "Mr" or "Mrs" scored +3.6??
Last modified: 2012-12-18 03:27:16 UTC
Hi there We just got a bunch of legitimate email marked as spam, and the dominant rule was HK_NAME_MR_MRS from 72_active.cf I realize a lot of "stranger" email will contain Mr/Mrs - but so does a lot of legit email. Three points for two chars is always a bad idea ;-)
(In reply to comment #0) > We just got a bunch of legitimate email marked as spam, and the dominant > rule was HK_NAME_MR_MRS from 72_active.cf > > I realize a lot of "stranger" email will contain Mr/Mrs - but so does a lot > of legit email. Three points for two chars is always a bad idea ;-) This rule (or, more specifically, __HK_NAME_MR_MRS which is referenced by it) triggers only if the display name in the From: field starts with "Mr" or "Mrs" or "Miss", followed by a word boundary. It is common for people to put a name or a role designation in the display name field. For example, you probably have "Jason Haar" in there, but somebody else might put a role description such as "Acme Sales". You do *not* have "Mr Jason Haar" as display name in the From: field of your messages. If you did, this would indeed appear a bit funny. In short, the rule does not trigger everytime somebody writes "Mr" in his message, but only if he has this word at the beginning of (the display name field of) his From: field. Apparently, this can distinguish spam from ham in the test corpora. For most people, it probably works pretty well. If it does not work for you, just turn it off in local.cf.
thanks for answering I didn't notice the From: hook. You are correct - the emails in question are from a local school - so "Mr" in the From line is understandable - if non-standard We have already altered the score down - but now that I know it's related to From: too, I'm going to increase the score and deal with the school via a separate rule Thanks! Jason
sorry - I forgot to mark as resolved :-)
I would like to re-open this bug because it hits on mr@domain.tld . Generally speaking this wouldn't be a big issue but this rules scores like "3.797 3.561 3.797 3.561" that are very high. I can't provide a sample, but is very simple to replicate the bug.
(In reply to comment #4) > I would like to re-open this bug because it hits on mr@domain.tld . > > Generally speaking this wouldn't be a big issue but this rules scores like > "3.797 3.561 3.797 3.561" that are very high. > > I can't provide a sample, but is very simple to replicate the bug. I think the scores are a bit high for this rule as well. Henrik, I've added some 2.0 and 3.0 scores to enforce some masscheck limits to this and the related rules in your sandbox. Index: rulesrc/sandbox/hege/20_hk.cf =================================================================== --- rulesrc/sandbox/hege/20_hk.cf (revision 1421841) +++ rulesrc/sandbox/hege/20_hk.cf (working copy) @@ -128,11 +128,17 @@ header __HK_NAME_DR From:name =~ /^DR\b/mi header __HK_NAME_FROM From:name =~ /^FROM\b/mi meta HK_NAME_MR_MRS __HK_NAME_MR_MRS && !FREEMAIL_FROM +score HK_NAME_MR_MRS 2.0 meta HK_NAME_FM_MR_MRS __HK_NAME_MR_MRS && FREEMAIL_FROM +score HK_NAME_FM_MR_MRS 3.0 meta HK_NAME_DR __HK_NAME_DR && !FREEMAIL_FROM +score HK_NAME_DR 2.0 meta HK_NAME_FM_DR __HK_NAME_DR && FREEMAIL_FROM +score HK_NAME_FM_DR 3.0 meta HK_NAME_FROM __HK_NAME_FROM && !FREEMAIL_FROM +score HK_NAME_FROM 2.0 meta HK_NAME_FM_FROM __HK_NAME_FROM && FREEMAIL_FROM +score HK_NAME_FM_FROM 3.0 endif Running make test and will commit in a little bit. Feel free to veto!
Any value to adding a (?!\@) in there to avoid the "mr@" case?
+1 for that idea
(In reply to comment #6) > Any value to adding a (?!\@) in there to avoid the "mr@" case? I think this would work but is your way more efficient? From:name =~ /^M(?:RS?|ISS)[^@\b]/mi
(In reply to comment #8) > (In reply to comment #6) > > Any value to adding a (?!\@) in there to avoid the "mr@" case? Playing with this more, I don't know what we are trying to hit and not hit. Here's a stub. I need more examples of what we are trying to hit/not hit to effectively work on this. For example, a co-worker who I'm going to start getting to help this project pointed out all we need is a period match for the cases I've identified. He's correct but I think it's because I need more case scenarios. use strict; my (@name, $name); #SHOULD NOT HIT $name[0] = 'mrfixit@test.com'; #SHOULD HIT $name[1] = 'mr.FixIt@test.com'; #SHOULD NOT HIT $name[2] = 'mr@test.com'; foreach $name (@name) { print "$name - "; print $name =~ /^M(?:RS?|ISS)\.\b/mi; print "\n"; }
#SHOULD HIT $name[3] = 'Mr Paul Guss <mrpaul@bk.ru>'; $name[4] = 'Mr Johnson Kwame <johnsonkwame@cantv.net>'; $name[5] = '"Mrs elisabeth .Alfredo" <assist_socialpmb@ifi.com.br>'; $name[6] = '"MRS. S. F. GADDAFI" <123miinfo@yippy.com>'; $name[7] = '"Mr. David Freder"<admin@r-e-m.co.za>'; $name[8] = '"Mr.Chamberlain"<audit@senate.gov>'; (from actual spam tagged with HK_NAME_MR_MRS)
(In reply to comment #8) > (In reply to comment #6) > > Any value to adding a (?!\@) in there to avoid the "mr@" case? > > I think this would work but is your way more efficient? > > From:name =~ /^M(?:RS?|ISS)[^@\b]/mi That's not syntactically equivalent. Also, I don't know whether \b can even appear in a character class as it's a zero-length assertion. Absent any testing, something like this: /^M(?:RS?|ISS)\b(?!\@)/mi (In reply to comment #9) I don't think we want to explicitly require a period, as that would miss something like "MR BOZO FRAUDSTER" which we do want to score. Also: why is From:name hitting on the from _address_ ? Does :name default to the email address if there is no display name present? Perhaps this would be better: /^M(?:RS?|ISS)\b(?!\S*\@)/mi ...to completely exclude non-display-name From headers where the email address begins with the targeted text.
(In reply to comment #11) > /^M(?:RS?|ISS)\b(?!\S*\@)/mi > > ...to completely exclude non-display-name From headers where the email > address begins with the targeted text. I'd vote for this.
(In reply to comment #11) > > From:name =~ /^M(?:RS?|ISS)[^@\b]/mi > > That's not syntactically equivalent. Also, I don't know whether \b can even > appear in a character class as it's a zero-length assertion. Agreed. I wasn't paying enough attention, sorry. > Also: why is From:name hitting on the from _address_ ? Does :name default to > the email address if there is no display name present? I think this is the crux of the issue. That appears to be a bug that was fixed in trunk. See bug 6354. I don't see how we can make the rule more friendly to mr@domain.tld unless they are running trunk. What's the *specific* From that you see so I can confirm it's likely to be resolved already in 3.4? Beyond that, I think the correct answer is to require 3.4 and to lower the scores.
(In reply to comment #9) > #SHOULD HIT > $name[1] = 'mr.FixIt@test.com'; IMO that shouldn't hit. mr. in an address could just be a jokey address. It's completely different from "Mr John Smith" which is a misplaced attempt at professional formality.
(In reply to comment #14) > (In reply to comment #9) > > > #SHOULD HIT > > $name[1] = 'mr.FixIt@test.com'; > > IMO that shouldn't hit. mr. in an address could just be a jokey address. > It's completely different from "Mr John Smith" which is a misplaced attempt > at professional formality. Yeah, it's taken a bit to think about what should and shouldn't hit. But if if my theory is correct, the current 3.4 trunk WON'T hit on that because it doesn't have a name portion. So "Mr. Fix It" <mr.fixit@fixit.com> would hit but <mr.fixit@fixit.com> would not because Name portion is blank. Not sure if that's still going to reduce FPs so step 1 is lowering the scores. Step 2 is likely to figure out what are some examples that should and should not hit based on 3.4 and improve the rule. I'm planning on committing a version based if loop on the rule and lower score ceilings for masscheck soon. Failed a make test for something else so I'm working on that. Index: rulesrc/sandbox/hege/20_hk.cf =================================================================== --- rulesrc/sandbox/hege/20_hk.cf (revision 1421841) +++ rulesrc/sandbox/hege/20_hk.cf (working copy) @@ -124,15 +124,25 @@ ifplugin Mail::SpamAssassin::Plugin::FreeMail -header __HK_NAME_MR_MRS From:name =~ /^M(?:RS?|ISS)\b/mi +#REQUIRING VERSION 3.4 BECAUSE From:name works improperly prior to that version. +if (version >= 3.004000) + header __HK_NAME_MR_MRS From:name =~ /^M(?:RS?|ISS)\b/mi + meta HK_NAME_MR_MRS __HK_NAME_MR_MRS && !FREEMAIL_FROM + score HK_NAME_MR_MRS 2.0 + meta HK_NAME_FM_MR_MRS __HK_NAME_MR_MRS && FREEMAIL_FROM + score HK_NAME_FM_MR_MRS 3.0 +endif + header __HK_NAME_DR From:name =~ /^DR\b/mi header __HK_NAME_FROM From:name =~ /^FROM\b/mi -meta HK_NAME_MR_MRS __HK_NAME_MR_MRS && !FREEMAIL_FROM -meta HK_NAME_FM_MR_MRS __HK_NAME_MR_MRS && FREEMAIL_FROM meta HK_NAME_DR __HK_NAME_DR && !FREEMAIL_FROM +score HK_NAME_DR 2.0 meta HK_NAME_FM_DR __HK_NAME_DR && FREEMAIL_FROM +score HK_NAME_FM_DR 3.0 meta HK_NAME_FROM __HK_NAME_FROM && !FREEMAIL_FROM +score HK_NAME_FROM 2.0 meta HK_NAME_FM_FROM __HK_NAME_FROM && FREEMAIL_FROM +score HK_NAME_FM_FROM 3.0 endif
(In reply to comment #15) > So "Mr. Fix It" <mr.fixit@fixit.com> would hit but <mr.fixit@fixit.com> > would not because Name portion is blank. I have seen a fair amount of legitimate email with something like: "foo@example"<foo@example.com>
(In reply to comment #14) > (In reply to comment #9) > > > #SHOULD HIT > > $name[1] = 'mr.FixIt@test.com'; > > IMO that shouldn't hit. mr. in an address could just be a jokey address. > It's completely different from "Mr John Smith" which is a misplaced attempt > at professional formality. +1
(In reply to comment #13) > (In reply to comment #11) > I don't see how we can make the rule more friendly to mr@domain.tld unless > they are running trunk. See comment #12
The rule looks good and more safe. I have 979 samples hitting this rule and all the spam is still hitting. The ham that stills hits are mostly from a education related mailing list. Mr Jaime Garcia Salinas <jaime.garciasalinas@uqconnect.edu.au> "Mr. Debojit Boro" <deb0001@tezu.ernet.in> Mr Gerhard Bissels <gerhard.bissels@library.coop> Miss Emily Telford <emily.telford@uqconnect.edu.au> "Mr. Jamie Pruden" <jpruden@siprep.org> These matches are within the spirit of the rule and since overall scores are below 2 I think is acceptable. The only use case that wasn't considered was the usage of latin charset like this =?UTF-8?B?TWlzc8OjbyBJdGluZXJhbnRl?= <notification+xxxxxxxx@facebookmail.com> (This is the raw for "Missão Itinerante") . Like the above is still an a acceptable FP. Again overall score is still low. Overall I would give a +1 to the patch.
We also get lots of hits on ham from schools. They like to be formal, alas. I'd seriously recommend ditching the rule, or at least scoring it MUCH lower.
Let's see if this helps now that rules are being published again. svn commit -m 'Lowered score and added condition for 3.4 because From:name works differently prior - bug 6817' Sending rulesrc/sandbox/hege/20_hk.cf Transmitting file data . Committed revision 1423268. [root@devel trunk]#