SA Bugzilla – Bug 5736
FPs on FROM_DOMAIN_NOVOWEL & URI_NOVOWEL
Last modified: 2008-10-31 05:26:31 UTC
As per the list email unterm-durchschnitt.de and unterm-durchschnitt.com are both domains that are suffering from FPs caused by these rules. The problem of course being "rchschn" is a series of 7 non-vowels, which is just enough to trigger the rules. Clearly the intent of the rule is to try to match obviously invalid domains that are randomly generated. However the assumption that 7 consonants in a row is well beyond what any legitimate domain would have is obviously invalid, particularly in languages German where long consonant strings are more common. Quite frankly, neither of these rules has a particularly high hit rate and FROM_DOMAIN_NOVOWEL is almost zero in hits. FROM_LOCAL_NOVOWEL does much better, but that's not a problem here. The question is, do we try to alter the rules to require 8? 9? or drop them? Or some of each (ie: drop FROM_DOMAIN_NOVOWEL and modify URI_NOVOWEL)?
the accuracy of those rules isn't great these days: http://ruleqa.spamassassin.org/?daterev=20071118-r596065-n&rule=%2FNOVOWEL&srcpath=&g=Change 0.00000 0.7266 0.0863 0.894 0.76 2.90 URI_NOVOWEL 0.00000 0.0768 0.0198 0.795 0.63 3.00 FROM_DOMAIN_NOVOWEL 89% and 79% accuracy, and pretty low hit rates. I suggest we reduce them to about 0.5 points each and make them 'tflags userconf' so the rescorer doesn't set scores for them automatically.
Well, that deals with the immediate problem. I would still review their accuracy again in the not-too-distant future and consider modifying or removing them if their performance continues to be poor. Particularly FROM_DOMAIN_NOVOWEL, which technically no longer meets the S/O requirements (although just barely under the line, it is under the line).
(In reply to comment #2) > I would still review their accuracy again in the not-too-distant future and > consider modifying or removing them if their performance continues to be poor. > Particularly FROM_DOMAIN_NOVOWEL, which technically no longer meets the S/O > requirements (although just barely under the line, it is under the line). yeah, agreed.
*** Bug 5653 has been marked as a duplicate of this bug. ***
moving on to 3.2.5, so 3.2.4 can be released
moving to 3.2.6 so that we can release a 3.2.5
hello, i'd like to emphasize the fact that something should be done about this problem in the near future, as we are hit very hard with false positives by this. I am the webmaster for lichtschlag-buchverlag.de and lichtschlag-medien.de (a German publishing company) and as soon as we use the name "Lichtschlag" as either part of the sender address and in the signature, both FROM_DOMAIN_NOVOWEL and URI_NOVOWEL will trigger, resulting in a false positive with score > 5. So this poor guy is now in the bizarre situation that he can't use his own name in the sender address, and can't mention his own (and perfectly legit) domain in mail, without the risk of getting marked as spam - in my opinion this is clearly unacceptable and should be addressed.
I put a modified version of the rules without "h" in the sandbox to make it not sensitive to names with ch, sh, and th in them. Let's see how performance compares with the existing rules. The rule is a heuristic anyway.
I own the domain "chrschn.de" and also suffer from this problem. In particular, I have problems sending mails to GMX addresses. As GMX uses SA as part of there spam detection, any mail from my domain is guaranteed to end up in the spam folder of the recipient as soon as it includes a URL to my website. I think treating "h" as a vowel in the rule, as already suggested, could acutally avoid most of the long-running non-vowel sequences in the German language.
How did Sidney's improved rule perform? We've got another domain FPing on this one, rssgmbh.de. This one is certainly quite reasonable for the company "RSS GmbH". You'll not typically see this in English speaking areas, but GmbH is a German company structure similar to LLC.
Ok, digging in ruleqa, sidneys rule is having a lower hit rate, but the FP rate is more-or-less zero. The spam hits went from 370 to 237, and the nonspam hits went from 19 to 0. The difference in hits is 133:19 spam:ham or a 7:1 ratio in the affected hits, or a S/O of 0.875. This is cutting into the rules effectiveness a bit, but it's clearly taking out most of the FPs as well. Since the existing rule is excessively punishing legitimate domains in Germany, I suggest we switch to Sidney's version if we're going to keep this rule. http://ruleqa.spamassassin.org/20081029-r708834-n/T_SIDNEY_FROM_DOMAIN_NOVOWEL/detail http://ruleqa.spamassassin.org/20081029-r708834-n/FROM_DOMAIN_NOVOWEL/detail At the very least, if we don't do this, we should push Justin's rescore of 0.5 out over sa-update. These rules hit way too little spam and have *WAY* to high a S/O to justify their high scores. This bug has been open for a shamefully long period of time for a bug that's causing FPs of 5.9 for legitimate domains. Let's try to get moving on this one.
(In reply to comment #11) > Since the existing rule is excessively punishing legitimate domains in Germany, > I suggest we switch to Sidney's version if we're going to keep this rule. +1, and possibly combined with a lower score anyway. Matt, feel free to check that in; you don't need to wait for reviews for changes to rules...
The ruleqa results are heavily biased anyway. The only ham hits are in Michael's corpus, which is quite "small" compared to Daryl's and Justin's ham corpus. Extrapolating the number of hams to align the corpora draws an even much worse picture and makes the S/O ratio drop significantly -- below the already *poor* 0.5 it shows today (which is without Theo's massive corpus, granted). Most of the English-centric ham corpora are much less likely to contain German company domains. I kind of wonder if From headers are a good indicator today anyway. Most of my spam shows a forged sender. The increasing problem of backscatter supports this. +1 for seriously down-scoring FROM_DOMAIN_NOVOWEL, if we keep it at all. Let's just hope GMX uses sa-update. Ironically, a German company. If they don't, I'm afraid it'll take quite some GMX users complaining, to gently massage the message from front-line support down to the tech staff. (GMX themself evaded this rule, FWIW, using gmx-gmbh.de with a hyphen. Doh!)
(In reply to comment #11) > Since the existing rule is excessively punishing legitimate domains in Germany, > I suggest we switch to Sidney's version if we're going to keep this rule. > > http://ruleqa.spamassassin.org/20081029-r708834-n/T_SIDNEY_FROM_DOMAIN_NOVOWEL/detail > > > http://ruleqa.spamassassin.org/20081029-r708834-n/FROM_DOMAIN_NOVOWEL/detail > > > At the very least, if we don't do this, we should push Justin's rescore of 0.5 > out over sa-update. These rules hit way too little spam and have *WAY* to high > a S/O to justify their high scores. > > This bug has been open for a shamefully long period of time for a bug that's > causing FPs of 5.9 for legitimate domains. Let's try to get moving on this one. I've done this now... both installing Sidney's version, and down-scoring all 3 rules to 0.5 each (they don't deserve higher with those S/Os). in trunk: : 145...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rules Sending rules/20_head_tests.cf Sending rules/20_uri_tests.cf Sending rules/50_scores.cf Transmitting file data ... Committed revision 709387. : 146...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rulesrc Sending rulesrc/sandbox/sidney/70_other.cf Transmitting file data . Committed revision 709388. in 3.2.x: : 151...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" rules Sending rules/20_head_tests.cf Sending rules/20_uri_tests.cf Sending rules/50_scores.cf Transmitting file data ... Committed revision 709389. and in 3.2.x updates: : 173...; svn commit -m "bug 5736: fix FROM_LOCAL_NOVOWEL, FROM_DOMAIN_NOVOWEL, URI_NOVOWEL to avoid common FPs; reduce their scores to 0.5 points each, too" Sending 20_head_tests.cf Sending 20_uri_tests.cf Sending 50_scores.cf Transmitting file data ... Committed revision 709393. and pushed a 3.2.x update. (3.3.0 update will go out automatically tonight.)
oh yeah, also: : 189...; svn commit -m "bug 5736: also set the tflags for the NOVOWEL rules to be 'userconf' so their scores are not set by the rescorer" Sending rules/20_head_tests.cf Sending rules/20_uri_tests.cf Transmitting file data .. Committed revision 709397.