SA Bugzilla – Bug 3013
fp: Opengroupware mailer
Last modified: 2005-03-11 09:04:29 UTC
Spamassassin has two rules, RATWARE_HASH_2, and RATWARE_HASH_2_V2 which are triggered by X-Mailer: headers longer than 16, and 14 characters that are one of (A-Z,a-z, 0-9,.,_). The opengroupware mailer uses the following header: X-Mailer: OpenGroupware.org which has 17 of the above characters in its tag, triggering both rules. Opengroupware generates legitimate email, is not a spam mailer, and shouldn't be rated with a relatively high spam score. The opengroupware maintainers have been contacted (see http://bugzilla.opengroupware.org/bugzilla/show_bug.cgi?id=607), but understandably feel that the header is completely legitimate and this problem should be fixed in SpamAssassin. This is a re-occuring problem, as seen by http://bugzilla.spamassassin.org/show_bug.cgi?id=2108. Can these rules have a list of exceptions added to them? I'm not a regular expression expert, so I don't know if a maintainable exception list is implementable.
I was just noticing that my results for those rules kind of suck: 0.090 0.1192 0.0235 0.835 1.00 1.00 RATWARE_HASH_2_V2 0.067 0.0867 0.0235 0.787 0.79 1.00 RATWARE_HASH_2 So I don't know if this is really a problem or not.
for example, my FPs include: X-Mailer: ClassifiedVentures X-Mailer: com.reunion.site.mail X-Mailer: com.snowball.mail X-Mailer: webmail.delfi.lt none of my valid hits use '_' or '.', which would solve the bottom three there as well as opengroupware and the other ticket mentioned before. ClassifiedVentures is still a problem, but I can't think of a way to see that as different as 'ckGmqXGFWNfaNAxRse' really... fyi, if I remove the underscore and period: 0.080 0.1087 0.0153 0.877 1.00 0.01 T_RATWARE_HASH_2_V2 0.061 0.0814 0.0153 0.842 0.86 0.01 T_RATWARE_HASH_2 0.090 0.1192 0.0235 0.835 0.83 2.67 RATWARE_HASH_2_V2 0.067 0.0867 0.0235 0.787 0.66 0.00 RATWARE_HASH_2 so for me they're net plus. the fps are the same for v2 and non: "ClassifiedVentures"
Removing the . and _ sounds like a decent solution to me (though my experience with spamassassin is limited to the past week). With web addresses so common, . has become a common delimiter to seperate words. For instance, Java uses the .tld.domain.project.subproject naming scheme for classes. It's no mistake that many of the X-Mailer: headers use internet domains as their identifiers. I think you're right that there's no simple way of distiguishing 'ckGmqXGFWNfaNAxRse' from ClassifiedVentures using regular expressions. Assuming what you're really looking for is either randomly generated X-Mailer strings (or some ratware guy just hitting keys on his keyboard), you might just look at the "information content" of the string. 'ckGmqXGFWNfaNAxRse' is a random string of upper/lowercase text. Where 'ClassifiedVentures' is not random at all. The random string contains more "information", where the non-random one contains less. A simple test might be trying to compress the string. If it's very compressible it has low information content, and wasn't generated randomly. If it's not very compressible it has high information content, and is probbably randomly generated. Slightly off topic, but could this kind of test could be applied to other parts of a message too? I've noticed a lot of spam having random strings inserted in them in an attempt to get past filters. If you could identify these strings as random, you could add to a mails spam rating.
more accuracy and performance bugs going to 3.1.0 milestone
testing now... should have results in a day or so.
ok, current versions ignore X-Mailer lines with a "."