Bug 2970 - "longwords" rules
Summary: "longwords" rules
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.1.0
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-26 23:59 UTC by Justin Mason
Modified: 2004-04-20 06:43 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
sample rule text/plain None Fred T [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-01-26 23:59:17 UTC
Robert Menschel <Robert at Menschel.net> said:

Date: Sun, 25 Jan 2004 22:37:03 -0800
Subject: [SAtalk] Longwords

Received an email this morning which reminded me about my longwords
rules, which apparently got lost when I migrated my mass-check system
from my mail server to my PC.

This was my exploration of the random words spammers have been including
at the bottom of their emails, or in their text portions, or in their
invisible text, to confuse some anti-spam software. (I call these words
Bayes Fodder, since over time it seems they are helping my Bayes identify
spam better and better and better.)

Anyway, I rebuilt, reran, refined, and:

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
   7431     7429        2    0.999   1.00   3.00  RM_bpt_longwords68a
   6596     6595        1    0.999   0.98   1.00  RM_bpt_longwords69a
   4163     4163        0    1.000   0.71   2.00  RM_bpt_longwords78a
   8761     8753        8    0.996   0.51   3.00  RM_bpt_longwords59a
   2950     2950        0    1.000   0.48   1.00  RM_bpt_longwords79a
   1162     1162        0    1.000   0.15   4.00  RM_bpt_longwords96a
   1025     1025        0    1.000   0.13   4.00  RM_bpt_longwords88a
    590      590        0    1.000   0.05   1.00  RM_bpt_longwords89a
    545      545        0    1.000   0.04   3.00  RM_bpt_longwords97
    442      442        0    1.000   0.02   1.00  RM_bpt_longwords98
    330      330        0    1.000   0.00   1.00  RM_bpt_longwords99

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
100.000  80.8088  19.1912    0.808   0.00    0.00  (all messages as %)
  8.102  10.0239   0.0114    0.999   1.00    3.00  RM_bpt_longwords68a
  7.192   8.8986   0.0057    0.999   0.98    1.00  RM_bpt_longwords69a
  4.539   5.6171   0.0000    1.000   0.71    2.00  RM_bpt_longwords78a
  9.553  11.8103   0.0455    0.996   0.51    3.00  RM_bpt_longwords59a
  3.217   3.9804   0.0000    1.000   0.48    1.00  RM_bpt_longwords79a
  1.267   1.5679   0.0000    1.000   0.15    4.00  RM_bpt_longwords96a
  1.118   1.3830   0.0000    1.000   0.13    4.00  RM_bpt_longwords88a
  0.643   0.7961   0.0000    1.000   0.05    1.00  RM_bpt_longwords89a
  0.594   0.7354   0.0000    1.000   0.04    3.00  RM_bpt_longwords97
  0.482   0.5964   0.0000    1.000   0.02    1.00  RM_bpt_longwords98
  0.360   0.4453   0.0000    1.000   0.00    1.00  RM_bpt_longwords99

Scores of course are set to my 9.0 required hits, so you'll probably want
to lower these scores. Depending on your system, an initial score of 0.5
or 1.0 for each rule might be worth while, and then you can increase the
scores slowly if these spam continue to sneak past your system.

In my 19k corpus, one ham matches three of these rules, two of which I've
scored at 3.0, and so that ham gets a score of 7.0 of 9. I may be
reducing those rules to 2.5 or 2.0 instead of 3.0 once I complete my next
global mass-check. So yes, caution is advised.

Bob Menschel

body     RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/
describe RM_bpt_longwords68a Long string of long words
score    RM_bpt_longwords68a 3.000  # 7429s/2h of 91714 corpus (74113s/17601h)
01/23/04
                                    # ham: userid list, 
                                    # "improving compatibility between computer
platforms demands certain levels "
body     RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/
describe RM_bpt_longwords69a Long string of long words
score    RM_bpt_longwords69a 1.000  # type=max:1 (add to 59a,68a) - 6595s/1h of
91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/
describe RM_bpt_longwords78a Long string of long words
score    RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/
describe RM_bpt_longwords59a Long string of long words
score    RM_bpt_longwords59a 3.000  # 8753s/8h of 91714 corpus (74113s/17601h)
01/23/04
                                    # ham: userid list
body     RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/
describe RM_bpt_longwords79a Long string of long words
score    RM_bpt_longwords79a 1.000  # type=max:1 (add to 78a) - 2950s/0h of
91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/
describe RM_bpt_longwords96a Long string of long words
score    RM_bpt_longwords96a 4.000  # 1162s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/
describe RM_bpt_longwords88a Long string of long words
score    RM_bpt_longwords88a 4.000  # 1025s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/
describe RM_bpt_longwords89a Long string of long words
score    RM_bpt_longwords89a 1.000  # type=max:1 (add to 88a) - 590s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/
describe RM_bpt_longwords97 Long string of long words
score    RM_bpt_longwords97 3.000  # 545s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/
describe RM_bpt_longwords98 Long string of long words
score    RM_bpt_longwords98 1.000  # type=max:1 (add to 97) - 442s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
describe RM_bpt_longwords99 Long string of long words
score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) - 330s/0h of 91714
corpus (74113s/17601h) 01/23/04




Given a hitrate of 10% with an S/O of 0.999, we gotta apply them ;)
adding to SVN now.

(PS: as I read the new Apache 2.0 license, we no longer need to verify CLA
receipt for patches/new rules sent by non-committers.  right?)
Comment 1 Daniel Quinlan 2004-02-23 22:04:09 UTC
working on this, apparently -- got tired of waiting for it to move forward
Comment 2 Fred T 2004-04-09 15:29:32 UTC
I have an important addition for this set:
RM_bpt_longwords512 it scored higher than all the rest.
Attaching rule next.

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
 125083   104895    20188    0.839   0.00    0.00  (all messages)
  20911    20908        3    0.999   1.00   2.00  RM_bpt_longwords512
  20129    20126        3    0.999   1.00   1.00  RM_bpt_longwords68a
  23236    23225       11    0.998   0.99   1.00  RM_bpt_longwords59a
  18138    18136        2    0.999   0.99   1.00  RM_bpt_longwords69a
  11766    11766        0    1.000   0.96   2.00  RM_bpt_longwords78a
   8662     8662        0    1.000   0.94   1.00  RM_bpt_longwords79a
Comment 3 Fred T 2004-04-09 15:30:16 UTC
Created attachment 1884 [details]
sample rule
Comment 4 Justin Mason 2004-04-20 14:24:44 UTC
I think there's already a form of this in SVN trunk. closing...
Comment 5 Fred T 2004-04-20 14:43:19 UTC
That rule:
  20911    20908        3    0.999   1.00   2.00  RM_bpt_longwords512

Was developed after the initial longwords rule and it's doing better than all 
the rest, if you added the previous rules, it's worth consideration to add this 
one as well.