Bug 5736 - FPs on FROM_DOMAIN_NOVOWEL & URI_NOVOWEL
Summary: FPs on FROM_DOMAIN_NOVOWEL & URI_NOVOWEL
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.2.3
Hardware: Other other
: P5 normal
Target Milestone: 3.2.6
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 5653 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-11-28 18:00 UTC by Matt Kettler
Modified: 2008-10-31 05:26 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Kettler 2007-11-28 18:00:59 UTC
As per the list email unterm-durchschnitt.de and unterm-durchschnitt.com are
both domains that are suffering from FPs caused by these rules.

The problem of course being "rchschn" is a series of 7 non-vowels, which is just
enough to trigger the rules. 

Clearly the intent of the rule is to try to match obviously invalid domains that
are randomly generated. However  the assumption that 7 consonants in a row is
well beyond what any legitimate domain would have is obviously invalid,
particularly in languages German where long consonant strings are more common.

Quite frankly, neither of these rules has a particularly high hit rate and
FROM_DOMAIN_NOVOWEL is almost zero in hits. FROM_LOCAL_NOVOWEL does much better,
but that's not a problem here. 

The question is, do we try to alter the rules to require 8? 9? or drop them? Or
some of each (ie: drop FROM_DOMAIN_NOVOWEL and modify URI_NOVOWEL)?
Comment 1 Justin Mason 2007-11-29 02:08:59 UTC
the accuracy of those rules isn't great these days:

http://ruleqa.spamassassin.org/?daterev=20071118-r596065-n&rule=%2FNOVOWEL&srcpath=&g=Change

0.00000   0.7266   0.0863   0.894    0.76    2.90  URI_NOVOWEL   
0.00000   0.0768   0.0198   0.795    0.63    3.00  FROM_DOMAIN_NOVOWEL   

89% and 79% accuracy, and pretty low hit rates.  I suggest we reduce them to
about 0.5 points each and make them 'tflags userconf' so the rescorer doesn't
set scores for them automatically.
Comment 2 Matt Kettler 2007-11-29 03:17:15 UTC
Well, that deals with the immediate problem.

I would still review their accuracy again in the not-too-distant future and
consider modifying or removing them if their performance continues to be poor.
Particularly FROM_DOMAIN_NOVOWEL, which technically no longer meets the S/O
requirements (although just barely under the line, it is under the line).

Comment 3 Justin Mason 2007-11-29 03:34:55 UTC
(In reply to comment #2)
> I would still review their accuracy again in the not-too-distant future and
> consider modifying or removing them if their performance continues to be poor.
> Particularly FROM_DOMAIN_NOVOWEL, which technically no longer meets the S/O
> requirements (although just barely under the line, it is under the line).

yeah, agreed.
Comment 4 Daryl C. W. O'Shea 2007-11-29 07:45:08 UTC
*** Bug 5653 has been marked as a duplicate of this bug. ***
Comment 5 Justin Mason 2008-01-01 12:42:51 UTC
moving on to 3.2.5, so 3.2.4 can be released
Comment 6 Justin Mason 2008-06-01 03:37:16 UTC
moving to 3.2.6 so that we can release a 3.2.5
Comment 7 Chris Vigelius 2008-07-26 07:58:38 UTC
hello,

i'd like to emphasize the fact that something should be done about this problem in the near future, as we are hit very hard with false positives by this.

I am the webmaster for lichtschlag-buchverlag.de and lichtschlag-medien.de (a German publishing company) and as soon as we use the name "Lichtschlag" as either part of the sender address and in the signature, both FROM_DOMAIN_NOVOWEL and URI_NOVOWEL will trigger, resulting in a false positive with score > 5.

So this poor guy is now in the bizarre situation that he can't use his own name in the sender address, and can't mention his own (and perfectly legit) domain in mail, without the risk of getting marked as spam - in my opinion this is clearly unacceptable and should be addressed.
Comment 8 Sidney Markowitz 2008-07-26 12:59:05 UTC
I put a modified version of the rules without "h" in the sandbox to make it not sensitive to names with ch, sh, and th in them. Let's see how performance compares with the existing rules. The rule is a heuristic anyway.
Comment 9 Christian Schneider 2008-09-17 07:15:34 UTC
I own the domain "chrschn.de" and also suffer from this problem. In particular, I have problems sending mails to GMX addresses. As GMX uses SA as part of there spam detection, any mail from my domain is guaranteed to end up in the spam folder of the recipient as soon as it includes a URL to my website. 

I think treating "h" as a vowel in the rule, as already suggested, could acutally avoid most of the long-running non-vowel sequences in the German language.
Comment 10 Matt Kettler 2008-10-30 06:07:48 UTC
How did Sidney's improved rule perform?

We've got another domain FPing on this one, rssgmbh.de. This one is certainly quite reasonable for the company "RSS GmbH". You'll not typically see this in English speaking areas, but GmbH is a German company structure similar to LLC.

Comment 11 Matt Kettler 2008-10-30 06:46:57 UTC
Ok, digging in ruleqa, sidneys rule is having a lower hit rate, but the FP rate is more-or-less zero.

The spam hits went from 370 to 237, and the nonspam hits went from 19 to 0. The difference in hits is 133:19 spam:ham or a 7:1 ratio in the affected hits, or a S/O of 0.875. This is cutting into the rules effectiveness a bit, but it's clearly taking out most of the FPs as well.

Since the existing rule is excessively punishing legitimate domains in Germany, I suggest we switch to Sidney's version if we're going to keep this rule.

http://ruleqa.spamassassin.org/20081029-r708834-n/T_SIDNEY_FROM_DOMAIN_NOVOWEL/detail


http://ruleqa.spamassassin.org/20081029-r708834-n/FROM_DOMAIN_NOVOWEL/detail


At the very least, if we don't do this, we should push Justin's rescore of 0.5 out over sa-update. These rules hit way too little spam and have *WAY* to high a S/O to justify their high scores.

This bug has been open for a shamefully long period of time for a bug that's causing FPs of 5.9 for legitimate domains. Let's try to get moving on this one.

Comment 12 Justin Mason 2008-10-30 07:32:28 UTC
(In reply to comment #11)
> Since the existing rule is excessively punishing legitimate domains in Germany,
> I suggest we switch to Sidney's version if we're going to keep this rule.

+1, and possibly combined with a lower score anyway.

Matt, feel free to check that in; you don't need to wait for reviews for changes to rules...
Comment 13 Karsten Bräckelmann 2008-10-30 10:13:47 UTC
The ruleqa results are heavily biased anyway. The only ham hits are in Michael's corpus, which is quite "small" compared to Daryl's and Justin's ham corpus. Extrapolating the number of hams to align the corpora draws an even much worse picture and makes the S/O ratio drop significantly -- below the already *poor* 0.5 it shows today (which is without Theo's massive corpus, granted). Most of the English-centric ham corpora are much less likely to contain German company domains.

I kind of wonder if From headers are a good indicator today anyway. Most of my spam shows a forged sender. The increasing problem of backscatter supports this.

+1 for seriously down-scoring FROM_DOMAIN_NOVOWEL, if we keep it at all.


Let's just hope GMX uses sa-update. Ironically, a German company. If they don't, I'm afraid it'll take quite some GMX users complaining, to gently massage the message from front-line support down to the tech staff.

(GMX themself evaded this rule, FWIW, using gmx-gmbh.de with a hyphen. Doh!)
Comment 14 Justin Mason 2008-10-31 05:24:33 UTC
(In reply to comment #11)
> Since the existing rule is excessively punishing legitimate domains in Germany,
> I suggest we switch to Sidney's version if we're going to keep this rule.
> 
> http://ruleqa.spamassassin.org/20081029-r708834-n/T_SIDNEY_FROM_DOMAIN_NOVOWEL/detail
> 
> 
> http://ruleqa.spamassassin.org/20081029-r708834-n/FROM_DOMAIN_NOVOWEL/detail
> 
> 
> At the very least, if we don't do this, we should push Justin's rescore of 0.5
> out over sa-update. These rules hit way too little spam and have *WAY* to high
> a S/O to justify their high scores.
> 
> This bug has been open for a shamefully long period of time for a bug that's
> causing FPs of 5.9 for legitimate domains. Let's try to get moving on this one.

I've done this now... both installing Sidney's version, and down-scoring all 3 rules to 0.5 each (they don't deserve higher with those S/Os).

in trunk:


: 145...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rules
Sending        rules/20_head_tests.cf
Sending        rules/20_uri_tests.cf
Sending        rules/50_scores.cf
Transmitting file data ...
Committed revision 709387.
: 146...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rulesrc
Sending        rulesrc/sandbox/sidney/70_other.cf
Transmitting file data .
Committed revision 709388.

in 3.2.x:

: 151...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rules
Sending        rules/20_head_tests.cf
Sending        rules/20_uri_tests.cf
Sending        rules/50_scores.cf
Transmitting file data ...
Committed revision 709389.

and in 3.2.x updates:

: 173...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too"
Sending        20_head_tests.cf
Sending        20_uri_tests.cf
Sending        50_scores.cf
Transmitting file data ...
Committed revision 709393.

and pushed a 3.2.x update. (3.3.0 update will go out automatically tonight.)
Comment 15 Justin Mason 2008-10-31 05:26:31 UTC
oh yeah, also:

: 189...; svn commit -m "bug 5736: also set the tflags for the NOVOWEL rules to be 'userconf' so their scores are not set by the rescorer"
Sending        rules/20_head_tests.cf
Sending        rules/20_uri_tests.cf
Transmitting file data ..
Committed revision 709397.