Bug 891 - REPEATED_URLS test
Summary: REPEATED_URLS test
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 normal
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-09-14 02:20 UTC by Daniel Quinlan
Modified: 2002-10-14 19:09 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2002-09-14 02:20:17 UTC
submitted by Ben Jackson as follows:

I get a lot of catalog and newsletter spam (no, I didn't sign up for it
but yes, someone probably did type my email address into a box somewhere).
One popular format for these messages is:

Enlarge your spam!
http://a.b.c/huge-url-with-ids

Get a free widget!
http://a.b.c/huge-url-with-ids

See what our naughty parakeet did on camera!
http://a.b.c/huge-url-with-ids

These all match the pattern "repeated lines starting with a url on
the same server".  In my last 3000 junked messages, over 500 match
this test for 3 or more urls.

So here's the rule:

rawbody REPEATED_URLS eval:check_for_repeated_urls()
describe REPEATED_URLS Catalog/newsletter lists of URLs (bjj)
score REPEATED_URLS 1.2

And the code in EvalTests.pm:

# Look for many lines with a URL at the beginning pointing to the same
# site.  This is common to catalog and newsletter junk.
sub check_for_repeated_urls {
	my ($self, $body) = @_;

	my %url_hosts = ();

	foreach (@{$body}) {
		next unless m,^https?://(\S+?)(/|$),;
		++$url_hosts{$1};
	}
	foreach (keys %url_hosts) {
		return 1 if $url_hosts{$_} >= 3;
	}
	return 0;
}
Comment 1 Daniel Quinlan 2002-09-14 02:21:28 UTC
Hmmm... The ratio is a little lower than I like:

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  11410     3714     7696    0.33    0.00    0.00  (all messages)
100.000   32.550   67.450    0.33    0.00    0.00  (all messages as %)
  2.375    6.058    0.598    0.91    0.45    1.20  REPEATED_URLS

After looking at the FPs a bit (meaning, not very hard), I changed the regular
expression from m,^https?://(\S+?)(/|$), to m,^https?://(\S+), and improved
the S/O ratio a bit at the expense of the spam percentage.

  0.929    2.585    0.130    0.95    0.50    1.20  REPEATED_URLS

I think that could be improved a bit, but 0.95 falls a little below my
comfort level for a single new rule.  For the newer version, the spams
that are matched are already very highly scored (23 average, 7.6
standard deviation).

I suspect the rule might also work better if integrated into the URI
code.  rawbody will miss a lot of potential hits.
Comment 2 Justin Mason 2002-10-15 03:09:55 UTC
I've seen a lot of this spam, particularly from Azoogle and "affiliated"
sites.  They're imitating a popular newsletter format for "new headlines".

I did a little testing a few weeks ago on this, testing for ways to catch
these, but I don't think we can catch this format without a good deal
of FPs for legit news-headlines newsletters.  Sorry, I'm closing this bug
as a WONTFIX.