Bug 1108

Summary: all caps headers
Product: Spamassassin Reporter: Doug McCasland <dougm>
Component: Regression TestsAssignee: Daniel Quinlan <quinlan>
Status: RESOLVED FIXED    
Severity: enhancement CC: dev
Priority: P2    
Version: 2.42   
Target Milestone: ---   
Hardware: Other   
OS: Linux   
Whiteboard:

Description Doug McCasland 2002-10-12 14:16:51 UTC
Suggestion for test:

A + score for any header that is all-caps.  I have found 99% of msgs that have 
the To: address in all caps to be spam.
Comment 1 Theo Van Dinter 2002-10-12 15:14:37 UTC
Subject: Re: [SAdev]  New: all caps headers

On Sat, Oct 12, 2002 at 02:16:51PM -0700, bugzilla-daemon@hughes-family.org wrote:
> A + score for any header that is all-caps.  I have found 99% of msgs that have 
> the To: address in all caps to be spam.

In a quick test, I have 27 spam hits and 74 nonspam hits.
Comment 2 Theo Van Dinter 2002-10-12 15:18:19 UTC
*** Bug 1111 has been marked as a duplicate of this bug. ***
Comment 3 Theo Van Dinter 2002-10-12 15:18:40 UTC
*** Bug 1110 has been marked as a duplicate of this bug. ***
Comment 4 Theo Van Dinter 2002-10-12 15:23:36 UTC
My test, BTW, was run using:

$ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' non-spam/* | wc -l
     75
$ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' spam/* | wc -l
     35

So this test doesn't look very good.  Somewhere around a .70 with a small 
match percentage (110 out of 20000).
Comment 5 Theo Van Dinter 2002-10-12 15:26:51 UTC
Oh, BTW...  Please don't add a new bug saying "revise this old bug", it's
extremely annoying.  You can just add a comment to the actual bug...
Comment 6 Matthew Cline 2002-10-12 15:35:59 UTC
I get an S/O of 0.112; not so hot.
Comment 7 Doug McCasland 2002-10-12 15:52:34 UTC
Hi, sorry about the dupes.  I think I get it now ;-o

Anyway, this is the main point, that "To:" (or other upper+lower tag) is not 
getting caught by the ALL_CAPS_HEADER test.

  TO: BOBO@FOO.BIZ   [the entire header is all caps]

gets a ALL_CAPS_HEADER score, fine.  But

  To: BOBO@FOO.BIZ  ["To:" is not all caps]

does not.  

Aside: my email server (postfix) converts incoming TO: to To:.
Comment 8 Daniel Quinlan 2002-10-13 01:27:51 UTC
I think I found a variation of this test that works well, but I had to
tweak it a lot, so I'm afraid it may be fragile outside of my corpus.  In
fact, I will be surprised if this test survives.

Notes:

- the test semantics are a bit different than the original suggestion too.
- if there are specific headers that cause more FPs than not, we can
  exclude them (I already excluded one.)

It's in CVS now as T_HEADER_ALL_CAPS.

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12402     4708     7694    0.38    0.00    0.00  (all messages)
100.000   37.962   62.038    0.38    0.00    0.00  (all messages as %)
  2.104    5.523    0.013    1.00    0.71    1.00  T_HEADER_ALL_CAPS

(assigning to me)
Comment 9 Theo Van Dinter 2002-10-13 09:28:03 UTC
Subject: Re: [SAdev]  all caps headers

On Sun, Oct 13, 2002 at 01:27:52AM -0700, bugzilla-daemon@hughes-family.org wrote:
> OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
>   12402     4708     7694    0.38    0.00    0.00  (all messages)
> 100.000   37.962   62.038    0.38    0.00    0.00  (all messages as %)
>   2.104    5.523    0.013    1.00    0.71    1.00  T_HEADER_ALL_CAPS

Not even close on mine:

 73.235   41.268   95.311    0.30    0.20    1.00  T_HEADER_ALL_CAPS

I'll poke around a little bit to see if there's a common cause for this,
but ...

Comment 10 Theo Van Dinter 2002-10-13 10:33:36 UTC
Subject: Re: [SAdev]  all caps headers

On Sun, Oct 13, 2002 at 09:28:04AM -0700, bugzilla-daemon@hughes-family.org wrote:
> I'll poke around a little bit to see if there's a common cause for this,
> but ...

I found the main cause was the X-UID header.  After removing that,
the scores at least look a little better:

  0.856    1.120    0.675    0.62    0.35    1.00  T_HEADER_ALL_CAPS

I found a number of mails which had semi-random "X-<something>: <no
letters in text>", so I modified the code to only check headers that have
[a-zA-Z] in the value portion:

  0.530    0.882    0.288    0.75    0.41    1.00  T_HEADER_ALL_CAPS

I then found a few more headers to remove:
	/^X-UIDL?:/
	/^X-IMDB-MAILFILTER:/
	/^X-SMTP-\w+:/
        /^X-AUTH:/
	/^X-EPRI-ID:/
	/^X-FID:/
	/^X-BG:/
        /^X-AS[HN]:/
	/^X-AUTO:/
	/^X-A[DNP]:/
	/^X-\d+:/

  0.370    0.870    0.025    0.97    0.62    1.00  T_HEADER_ALL_CAPS

The left over FPs are valid mails with either all-caps TO or CONTENT-TYPE
headers.

To make the list easier, I'm tempted to say ignore /^X-/, and just pay
attention to "real" headers.  Overall, I don't think the benefits justify
the cost of processing, but YMMV.

Comment 11 Daniel Quinlan 2002-10-13 13:45:11 UTC
Subject: Re:  all caps headers

felicity@kluge.net writes:

> To make the list easier, I'm tempted to say ignore /^X-/, and just pay
> attention to "real" headers.  Overall, I don't think the benefits justify
> the cost of processing, but YMMV.

I think I tried ignoring /^X-/ and lot almost all of the SPAM%, but I
can't remember which version of the test that was, so I'll try it
again.

Dan

Comment 12 Daniel Quinlan 2002-10-13 17:08:25 UTC
Version that just tests headers not matching with /^X-/

  0.065    0.170    0.000    1.00    0.67    1.00  T_HEADER_ALL_CAPS

I looked at the spam hit header names of my previous version and here
they are:

    195 X-X:
     11 X-POSTMASTER:
      8 X-SLUIDL:
      8 SUBJECT:
      4 X-PLATTER:
      4 X-IONK:
      4 X-CRUNCHERS:
      4 X-CORONNA:
      2 X-REFERER:
      2 X-FROM:

X-X: has another rule, SUBJECT: has (or had, we might have removed it
due to FPs) another rule, and the others aren't particularly high
frequency.  I'm closing this bug as WONTFIX and removing the stuff from
CVS.
Comment 13 Doug McCasland 2002-10-13 18:16:57 UTC
Well, I didn't understand all the hoopla.  ;-)  But here's what I put in the 
the system-wide prefs:

header ALLCAPS_TOCC ToCc !~ /[a-z]+\@/
score ALLCAPS_TOCC 2.0

This adds 2.0 if the ToCC headers didn't contain at least one lower-case lhs --
 regardless of whether it's To: or TO: (or Cc: or CC:).  I'm not yet sure 
about any side-effects.  +2.0 might seem like a stiff penalty, but I have 
never seen an upper-case ToCc that wasn't spam.  (Why spammers end up with UC 
addrs, and then use them without conversion, is beyond me.)  Thanks for all 
your work.
Comment 14 Daniel Quinlan 2002-10-13 18:56:41 UTC
Subject: Re:  all caps headers

dougm@bravoecho.net:

> header ALLCAPS_TOCC ToCc !~ /[a-z]+\@/
> score ALLCAPS_TOCC 2.0

That's not a very good rule.  This will match any sort of address
ending in a number.  I get the impression you aren't really testing
these rules.

> This adds 2.0 if the ToCC headers didn't contain at least one
> lower-case lhs -- regardless of whether it's To: or TO: (or Cc: or
> CC:).  I'm not yet sure about any side-effects.  +2.0 might seem
> like a stiff penalty, but I have never seen an upper-case ToCc that
> wasn't spam.  (Why spammers end up with UC addrs, and then use them
> without conversion, is beyond me.)  Thanks for all your work.

2.0 would be WAY too high of a score for that rule.  I only get an S/O
of 0.95.  Most rules with a score of 2.0 have an S/O of 1.00 or very
close to that.  (A few have lower S/O ratios, but they hit relatively
few messages.)

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12402     4708     7694    0.38    0.00    0.00  (all messages)
100.000   37.962   62.038    0.38    0.00    0.00  (all messages as %)
 10.224   24.894    1.248    0.95    0.50    1.00  T_ALLCAPS_TOCC

Some rules with scores of about 2.0:

  1.121    2.952    0.000    1.00    0.86    2.00  FORGED_MX_HOTMAIL
  0.734    1.933    0.000    1.00    0.83    2.03  FORGED_RCVD_TRAIL
  0.419    1.105    0.000    1.00    0.79    2.01  COMPARE_RATES
  0.298    0.786    0.000    1.00    0.77    1.99  COPY_ACCURATELY
  0.266    0.701    0.000    1.00    0.76    1.93  HR_4176
  0.250    0.658    0.000    1.00    0.76    1.98  LIVE_PORN
  0.161    0.425    0.000    1.00    0.73    1.99  CBYI
  0.814    2.124    0.013    0.99    0.65    2.01  BULK_EMAIL
  0.016    0.042    0.000    1.00    0.57    2.00  LYING_EYES
  0.177    0.446    0.013    0.97    0.54    1.94  DRASTIC_REDUCED
  0.403    0.913    0.091    0.91    0.46    2.06  CHARSET_FARAWAY_HEADERS
  0.185    0.382    0.065    0.85    0.42    1.96  US_DOLLARS

Comment 15 Justin Mason 2002-10-14 09:44:02 UTC
Subject: Re: [SAdev]  all caps headers 


bugzilla-daemon@hughes-family.org said:

> $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' non-spam/* | wc -l
>      75
> $ pcregrep '^(?i:to):\s+[A-Z\@\.\-\_0-9\s<>()]+$' spam/* | wc -l
>      35
> 
> So this test doesn't look very good.  Somewhere around a .70 with a small 
> match percentage (110 out of 20000).

Yep, about the same here.  It's worth noting that the default 
setup for LISTSERV mailing lists uses upper-case for the From
and To addrs, so there's a big pool of false positives right
there.

--j.

Comment 16 Daniel Quinlan 2002-10-14 19:51:43 UTC
(reopening bug)

I wrote some rules with good stats (for me) that are worth testing.  They're
a bit more complicated and we'll need to figure out which ones work the best
for everyone (some mailing list software might be a problem).

Here are my results for the new rule.  The best RANK for previous rules
was about 0.50.  These are all much better (some lower than others, but
this is the set that looks worthy of testing on other corpuses).

OVERALL%   SPAM% NONSPAM%     S/O    RANK   SCORE  NAME
  12402     4708     7694    0.38    0.00    0.00  (all messages)
100.000   37.962   62.038    0.38    0.00    0.00  (all messages as %)
  0.508    1.338    0.000    1.00    0.80    1.00  T_NO_LOWER_TO_CC_ALL_E
  0.508    1.338    0.000    1.00    0.80    1.00  T_NO_LOWER_TO_ALL_E
  0.484    1.274    0.000    1.00    0.80    1.00  T_NO_LOWER_TOCC_ALL_E
  0.169    0.446    0.000    1.00    0.73    1.00  T_NO_LOWER_FROM_1
  0.129    0.340    0.000    1.00    0.71    1.00  T_NO_LOWER_FROM_2
  4.636   12.107    0.065    0.99    0.66    1.00  T_NO_LOWER_TOCC_USER_E
  4.112   10.599    0.143    0.99    0.59    1.00  T_NO_LOWER_TOCC_HOST_E
  4.733   12.192    0.169    0.99    0.59    1.00  T_NO_LOWER_TOCC_EITHER_E
  7.523   19.329    0.299    0.98    0.59    1.00  T_NO_LOWER_TOCC_USER
  4.185   10.705    0.195    0.98    0.57    1.00  T_NO_LOWER_TO_HOST_E
  7.620   19.414    0.403    0.98    0.57    1.00  T_NO_LOWER_TOCC_EITHER
  6.999   17.821    0.377    0.98    0.56    1.00  T_NO_LOWER_TOCC_HOST
  4.241   10.790    0.234    0.98    0.56    1.00  T_NO_LOWER_TO_CC_HOST_E
  4.814   12.213    0.286    0.98    0.56    1.00  T_NO_LOWER_TO_USER_E
  4.870   12.319    0.312    0.98    0.55    1.00  T_NO_LOWER_TO_CC_USER_E
  3.370    8.496    0.234    0.97    0.55    1.00  T_NO_LOWER_TOCC_ALL
  7.394   18.585    0.546    0.97    0.54    1.00  T_NO_LOWER_TO_HOST
  8.023   20.093    0.637    0.97    0.54    1.00  T_NO_LOWER_TO_USER
  4.943   12.298    0.442    0.97    0.53    1.00  T_NO_LOWER_TO_EITHER_E
  3.717    9.218    0.351    0.96    0.52    1.00  T_NO_LOWER_TO_ALL
  8.152   20.178    0.793    0.96    0.52    1.00  T_NO_LOWER_TO_EITHER
  5.023   12.404    0.507    0.96    0.52    1.00  T_NO_LOWER_TO_CC_EITHER_E

Comment 17 Daniel Quinlan 2002-11-17 17:14:24 UTC
Only one rule worth keeping, one of the T_FROM_NO_LOWER variants is
now T_FROM_NO_LOWER.  The rest are DELETED.