SA Bugzilla – Bug 891
REPEATED_URLS test
Last modified: 2002-10-14 19:09:55 UTC
submitted by Ben Jackson as follows: I get a lot of catalog and newsletter spam (no, I didn't sign up for it but yes, someone probably did type my email address into a box somewhere). One popular format for these messages is: Enlarge your spam! http://a.b.c/huge-url-with-ids Get a free widget! http://a.b.c/huge-url-with-ids See what our naughty parakeet did on camera! http://a.b.c/huge-url-with-ids These all match the pattern "repeated lines starting with a url on the same server". In my last 3000 junked messages, over 500 match this test for 3 or more urls. So here's the rule: rawbody REPEATED_URLS eval:check_for_repeated_urls() describe REPEATED_URLS Catalog/newsletter lists of URLs (bjj) score REPEATED_URLS 1.2 And the code in EvalTests.pm: # Look for many lines with a URL at the beginning pointing to the same # site. This is common to catalog and newsletter junk. sub check_for_repeated_urls { my ($self, $body) = @_; my %url_hosts = (); foreach (@{$body}) { next unless m,^https?://(\S+?)(/|$),; ++$url_hosts{$1}; } foreach (keys %url_hosts) { return 1 if $url_hosts{$_} >= 3; } return 0; }
Hmmm... The ratio is a little lower than I like: OVERALL% SPAM% NONSPAM% S/O RANK SCORE NAME 11410 3714 7696 0.33 0.00 0.00 (all messages) 100.000 32.550 67.450 0.33 0.00 0.00 (all messages as %) 2.375 6.058 0.598 0.91 0.45 1.20 REPEATED_URLS After looking at the FPs a bit (meaning, not very hard), I changed the regular expression from m,^https?://(\S+?)(/|$), to m,^https?://(\S+), and improved the S/O ratio a bit at the expense of the spam percentage. 0.929 2.585 0.130 0.95 0.50 1.20 REPEATED_URLS I think that could be improved a bit, but 0.95 falls a little below my comfort level for a single new rule. For the newer version, the spams that are matched are already very highly scored (23 average, 7.6 standard deviation). I suspect the rule might also work better if integrated into the URI code. rawbody will miss a lot of potential hits.
I've seen a lot of this spam, particularly from Azoogle and "affiliated" sites. They're imitating a popular newsletter format for "new headlines". I did a little testing a few weeks ago on this, testing for ways to catch these, but I don't think we can catch this format without a good deal of FPs for legit news-headlines newsletters. Sorry, I'm closing this bug as a WONTFIX.