Bug 369 - DOUBLE_CAPSWORD test is no good
Summary: DOUBLE_CAPSWORD test is no good
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P2 major
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-05-30 22:19 UTC by Daniel Quinlan
Modified: 2002-07-10 06:48 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
This is being tagged as DOUBLE_CAPSWORD. Why? text/plain None Rod Begbie [HasCLA]
A non-spam that triggers DOUBLE_CAPSWORD, but shouldn't. text/plain None Rod Begbie [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2002-05-30 22:19:35 UTC
This test matches such non-spammy phrases as:

  I think I will be on vacation next week.

and

  We will release LSB 1.3 six months after LSB 1.2.

In my corpus 3133/4824 non-spam messages match (65%) and 1030/1322
spam messages match (78%).
Comment 1 Daniel Quinlan 2002-06-05 22:47:34 UTC
It worse now!

With ok_languages turned on, DOUBLE_CAPSWORD is the second slowest test!

Total Elapsed Time = 56.54792 Seconds
  User+System Time = 56.54792 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 20.2   11.43 11.430     25   0.4572 0.4572  Mail::SpamAssassin::TextCat::creat
                                             e_lm
 19.7   11.19 11.187    752   0.0149 0.0149  Mail::SpamAssassin::PerMsgStatus::
                                             DOUBLE_CAPSWORD_body_test

It's not really clear that this rule works especially well:

$ egrep -c DOUBLE_CAPSWORD *.log
nonspam.log:1563
spam.log:840

The rule matches only about twice as often on spam as it does on non-spam.
Or, another way to put it is that 1/3 of the matches are on non-spam.

Comment 2 Craig Hughes 2002-06-06 20:04:20 UTC
Are you sure you're using the latest CVS rules?

bash2.05 craig@balam ~/code/spamassassin % perl -d:DProf tools/speedtest t/data/nice/* t/data/
spam/*
2.6 t/data/nice/001 PLING,DOUBLE_CAPSWORD,LINES_OF_YELLING_2,LINES_OF_YELLING
-2.2 t/data/nice/002 IN_REP_TO,X_AUTH_WARNING,MSG_ID_ADDED_BY_MTA_3
6.8 t/data/nice/CVS 
INVALID_DATE,FROM_MISSING,X_NOT_PRESENT,DATE_MISSING,SUBJ_MISSING,MISSI
NG_HEADERS
-3.3 t/data/nice/base64.txt 
IN_REP_TO,SUBJ_ENDS_IN_Q_MARK,DOUBLE_CAPSWORD,MIME_NULL_BLOCK
25.3 t/data/spam/001 
ALL_CAPS_HEADER,FROM_HAS_MIXED_NUMS,INVALID_MSGID,INVALID_DATE,MAY_BE_F
ORGED,MSGID_HAS_NO_AT,UNDISC_RECIPS,FROM_ENDS_IN_NUMS,NO_REAL_NAME,PLI
NG,X_NOT_PRESENT,FOR_FREE,CLICK_BELOW,TO_BE_REMOVED_REPLY,EXCUSE_12,REMO
VE_SUBJ,REMOVE_IN_QUOTES,EXCUSE_4,NORMAL_HTTP_TO_IP,FREQ_SPAM_PHRASE,FO
RGED_YAHOO_RCVD,DATE_IN_FUTURE_03_06
11.7 t/data/spam/002 
INVALID_DATE,UNDISC_RECIPS,ADVERT_CODE,NO_REAL_NAME,FORGED_RCVD_FOUND,X
_NOT_PRESENT,EXCUSE_4,REMOVE_PAGE,SUBJ_ALL_CAPS
16 t/data/spam/003 
FROM_ENDS_IN_NUMS,NO_REAL_NAME,DOUBLE_CAPSWORD,ALL_NATURAL,CLICK_BELO
W,EXCUSE_3,NUMERIC_HTTP_ADDR,MAILTO_TO_SPAM_ADDR,CLICK_HERE_LINK,SUBJ_AL
L_CAPS,FORGED_HOTMAIL_RCVD,DATE_IN_PAST_12_24
8.6 t/data/spam/004 
SUBJ_HAS_SPACES,VERY_SUSP_CC_RECIPS,INVALID_DATE,PLING,X_NOT_PRESENT,MAIL
TO_WITH_SUBJ,SUBJ_HAS_UNIQ_ID,DATE_IN_FUTURE_06_12
6.8 t/data/spam/CVS 
INVALID_DATE,FROM_MISSING,X_NOT_PRESENT,DATE_MISSING,SUBJ_MISSING,MISSI
NG_HEADERS
7 t/data/spam/base64.txt 
NO_REAL_NAME,EXCUSE_3,DOUBLE_CAPSWORD,MAILTO_TO_REMOVE,BASE64_ENC_TEXT
bash2.05 craig@balam ~/code/spamassassin % dprofpp -O30                                            
Total Elapsed Time = 3.738573 Seconds
  User+System Time = 2.184542 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 16.2   0.355  0.960     10   0.0355 0.0960  Mail::SpamAssassin::PerMsgStatus::_body_tests
 8.79   0.192  0.205      1   0.1923 0.2046  Mail::SpamAssassin::Conf::_parse
 8.70   0.190  0.169   1373   0.0001 0.0001  Mail::SpamAssassin::NoMailAudit::_get_header_list
 8.24   0.180  0.180     10   0.0180 0.0180  Mail::SpamAssassin::PerMsgStatus::porn_word_test
 7.74   0.169  0.391   1311   0.0001 0.0003  Mail::SpamAssassin::PerMsgStatus::get
 6.87   0.150  1.109     10   0.0150 0.1109  Mail::SpamAssassin::PerMsgStatus::do_body_tests
 6.82   0.149  1.727     32   0.0046 0.0540  Mail::SpamAssassin::PerMsgStatus::BEGIN
 5.91   0.129  0.586     40   0.0032 0.0146  Mail::SpamAssassin::PerMsgStatus::run_eval_tests
 5.58   0.122  0.148     10   0.0122 0.0148  Mail::SpamAssassin::PerMsgStatus::_rawbody_tests
 5.04   0.110  0.236   1373   0.0001 0.0002  Mail::SpamAssassin::NoMailAudit::get_header
 5.04   0.110  0.757     13   0.0084 0.0582  Mail::SpamAssassin::BEGIN
 4.07   0.089  0.104     30   0.0030 0.0035  Net::DNS::RR::BEGIN
 3.66   0.080  0.461     10   0.0080 0.0461  Mail::SpamAssassin::PerMsgStatus::do_head_tests
 2.88   0.063  0.050     10   0.0063 0.0050  Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs
 2.75   0.060  0.059     80   0.0007 0.0007  Mail::SpamAssassin::NoMailAudit::get_all_headers
 2.75   0.060  0.059     10   0.0060 0.0059  
Mail::SpamAssassin::PerMsgStatus::RATWARE_head_test
 2.75   0.060  0.098      6   0.0100 0.0164  FindBin::BEGIN
 2.75   0.060  0.184     14   0.0043 0.0132  Razor::Client::BEGIN
 2.29   0.050  0.050     10   0.0050 0.0050  Exporter::heavy_export
 2.29   0.050  0.904      5   0.0099 0.1808  main::BEGIN
 2.29   0.050  0.058      6   0.0083 0.0097  IO::Socket::BEGIN
 1.83   0.040  0.038    129   0.0003 0.0003  
Mail::SpamAssassin::PerMsgStatus::ONE_HUNDRED_PC_FREE_body_test
 1.83   0.040  0.072     10   0.0040 0.0072  Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests
 1.37   0.030  0.028    129   0.0002 0.0002  
Mail::SpamAssassin::PerMsgStatus::FREE_MONEY_body_test
 1.37   0.030  0.030      2   0.0150 0.0149  Mail::SpamAssassin::read_cf
 1.37   0.030  0.028    129   0.0002 0.0002  
Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test
 1.37   0.030  0.030      3   0.0100 0.0100  Cwd::abs_path
 1.37   0.030  0.028    129   0.0002 0.0002  
Mail::SpamAssassin::PerMsgStatus::EXCUSE_17_body_test
 1.37   0.030  0.028    129   0.0002 0.0002  
Mail::SpamAssassin::PerMsgStatus::MAIL_IN_ORDER_FORM_body_test
 1.37   0.030  0.020    660   0.0000 0.0000  Mail::SpamAssassin::PerMsgStatus::clear_test_state
Comment 3 Daniel Quinlan 2002-06-06 20:30:04 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good


> Are you sure you're using the latest CVS rules?

Yes.  I think the problem is the \1 backreference.  No other body
rules have backreferences.

Several possibilities off the top of my head:

(1) I run my tests over all messages because I want to make sure that
some rules don't completely fall apart on large messages.  It could be
that not running it over large messages avoids the problem.

(2) My machine only has 176MB of RAM.  It could be that you have
enough RAM that DOUBLE_CAPSWORD doesn't thrash the machine -- like it
did mine.

(3) I'm using perl v5.6.1.

Also, the test does not appear to be very good at catching spam.
Perhaps revise to something like this?

/(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER).*(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER)/

Comment 4 Daniel Quinlan 2002-06-06 20:31:15 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good


> Are you sure you're using the latest CVS rules?

Yes.  I think the problem is the \1 backreference.  No other body
rules have backreferences.

Several possibilities off the top of my head:

(1) I run my tests over all messages because I want to make sure that
some rules don't completely fall apart on large messages.  It could be
that not running it over large messages avoids the problem.

(2) My machine only has 176MB of RAM.  It could be that you have
enough RAM that DOUBLE_CAPSWORD doesn't thrash the machine -- like it
did mine.

(3) I'm using perl v5.6.1.

Also, the test does not appear to be very good at catching spam.
Perhaps revise to something like this?

/(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER).*(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER)/

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-devel mailing list
Spamassassin-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/spamassassin-devel

Comment 5 Craig Hughes 2002-06-06 22:38:03 UTC
Ok, well I ran current CVS against about 200 messages from the spam corpus, and got this:

bash2.05 craig@balam ~/code/spamassassin % dprofpp
Total Elapsed Time = 37.72486 Seconds
  User+System Time = 21.85131 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 64.6   14.12 34.417    111   0.1272 0.3101  Mail::SpamAssassin::PerMsgStatus::_body_tests
 18.3   4.008  4.008    111   0.0361 0.0361  Mail::SpamAssassin::PerMsgStatus::porn_word_test
 15.3   3.361  6.483    111   0.0303 0.0584  Mail::SpamAssassin::PerMsgStatus::_rawbody_tests
 12.2   2.686  2.456  15626   0.0002 0.0002  Mail::SpamAssassin::NoMailAudit::_get_header_list
 8.02   1.752  5.329  14978   0.0001 0.0004  Mail::SpamAssassin::PerMsgStatus::get
 7.99   1.745  3.737  15626   0.0001 0.0002  Mail::SpamAssassin::NoMailAudit::get_header
 6.88   1.504  2.352    111   0.0135 0.0212  Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs
 6.58   1.438  0.831  40571   0.0000 0.0000  Mail::SpamAssassin::PhraseFreqs::test_word_pair
 6.48   1.417  1.755    111   0.0128 0.0158  
Mail::SpamAssassin::PerMsgStatus::get_decoded_stripped_body_text_array
 5.16   1.127 10.990    444   0.0025 0.0248  Mail::SpamAssassin::PerMsgStatus::run_eval_tests
 4.89   1.068  1.006   4271   0.0003 0.0002  
Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test
 3.93   0.858  0.902    111   0.0077 0.0081  
Mail::SpamAssassin::PerMsgStatus::RATWARE_head_test
 3.41   0.746  1.291    111   0.0067 0.0116  Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests
 3.29   0.719  0.656   4271   0.0002 0.0002  
Mail::SpamAssassin::PerMsgStatus::PORN_9_body_test
 3.15   0.689  0.626   4271   0.0002 0.0001  
Mail::SpamAssassin::PerMsgStatus::PORN_12_body_test
Comment 6 Daniel Quinlan 2002-06-07 00:33:43 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good

Craig, for your 200 message run, it kinda looks like it only got ran
on 111 messages.

I just ran this on the current CVS with no changes whatsoever.  I do
have "ok_languages en" in "masses/spamassassin.prefs".  Maybe there's
an interaction.

$ perl -d:DProf ./mass-check --mh --head=200 --sort --all mail/spam > /dev/null

and got this:

$ dprofpp
Total Elapsed Time = 227.3120 Seconds
  User+System Time = 222.6920 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 21.4   47.79 85.035    191   0.2503 0.4452  Mail::SpamAssassin::TextCat::class
                                             ify
 16.7   37.21 37.219    191   0.1949 0.1949  Mail::SpamAssassin::TextCat::creat
                                             e_lm
 9.14   20.35 20.359    201   0.1013 0.1013  Mail::SpamAssassin::PerMsgStatus::
                                             porn_word_test
 8.83   19.66 19.677   7740   0.0025 0.0025  Mail::SpamAssassin::PerMsgStatus::
                                             DOUBLE_CAPSWORD_body_test
 8.78   19.54 20.531    201   0.0973 0.1021  Mail::SpamAssassin::PerMsgStatus::
                                             get_decoded_stripped_body_text_arr
                                             ay
 6.22   13.84 75.161    201   0.0689 0.3739  Mail::SpamAssassin::PerMsgStatus::
                                             _body_tests
 2.96   6.588 12.700    201   0.0328 0.0632  Mail::SpamAssassin::PerMsgStatus::
                                             _rawbody_tests
 1.97   4.388  5.707    201   0.0218 0.0284  Mail::SpamAssassin::PhraseFreqs::c
                                             heck_phrase_freqs
 1.56   3.479  3.374  30323   0.0001 0.0001  Mail::SpamAssassin::NoMailAudit::_
                                             get_header_list
 1.06   2.369  2.342   7740   0.0003 0.0003  Mail::SpamAssassin::PerMsgStatus::
                                             PORN_10_body_test

Here's another identical run with an empty "masses/spamassassin.prefs".
Seems to shoot down the interaction theory.  Note that PORN_10 is run
just as many times, but doesn't take nearly as long.  I'll try to figure
out which messages are taking the longest.

$ dprofpp
Total Elapsed Time = 141.9831 Seconds
  User+System Time = 141.3831 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 14.4   20.43 20.439    201   0.1017 0.1017  Mail::SpamAssassin::PerMsgStatus::
                                             porn_word_test
 13.8   19.64 19.672   7740   0.0025 0.0025  Mail::SpamAssassin::PerMsgStatus::
                                             DOUBLE_CAPSWORD_body_test
 13.7   19.40 20.383    201   0.0966 0.1014  Mail::SpamAssassin::PerMsgStatus::
                                             get_decoded_stripped_body_text_arr
                                             ay
 10.6   14.99 76.498    201   0.0746 0.3806  Mail::SpamAssassin::PerMsgStatus::
                                             _body_tests
 4.61   6.523 13.036    201   0.0325 0.0649  Mail::SpamAssassin::PerMsgStatus::
                                             _rawbody_tests
 3.10   4.385  5.766    201   0.0218 0.0287  Mail::SpamAssassin::PhraseFreqs::c
                                             heck_phrase_freqs
 2.55   3.609  3.519  30292   0.0001 0.0001  Mail::SpamAssassin::NoMailAudit::_
                                             get_header_list
 1.58   2.239  2.226   7740   0.0003 0.0003  Mail::SpamAssassin::PerMsgStatus::
                                             PORN_10_body_test
 1.47   2.079  2.167    201   0.0103 0.0108  Mail::SpamAssassin::PerMsgStatus::
                                             RATWARE_head_test
 1.47   2.079  5.383  29642   0.0001 0.0002  Mail::SpamAssassin::NoMailAudit::g
                                             et_header

Comment 7 Craig Hughes 2002-06-07 01:42:05 UTC
Out of curiosity, what version of perl are you using?

[craig@belphegore craig]$ perl -V:version
version='5.6.1';
Comment 8 Craig Hughes 2002-06-07 01:45:56 UTC
Ok, ran again, this time using mass-check instead of speedtest, and checking against a different 
corpus chunk, some of which includes some large hex messages, etc:

[craig@belphegore spamassassin]$ dprofpp 
Total Elapsed Time = 65.86101 Seconds
  User+System Time = 65.06101 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 23.8   15.52 38.793    201   0.0772 0.1930  Mail::SpamAssassin::PerMsgStatus::_body_tests
 19.1   12.43 13.966    201   0.0619 0.0695  Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs
 4.32   2.810  2.810    201   0.0140 0.0140  Mail::SpamAssassin::PerMsgStatus::porn_word_test
 3.34   2.170  4.119    201   0.0108 0.0205  Mail::SpamAssassin::PerMsgStatus::_rawbody_tests
 2.94   1.910  1.878  31656   0.0001 0.0001  Mail::SpamAssassin::NoMailAudit::_get_header_list
 2.43   1.580  1.484  95709   0.0000 0.0000  Mail::SpamAssassin::PhraseFreqs::test_word_pair
 1.60   1.040  1.031   8462   0.0001 0.0001  Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test
 1.40   0.910  2.676  31218   0.0000 0.0001  Mail::SpamAssassin::NoMailAudit::get_header
 1.21   0.790  0.781   8462   0.0001 0.0001  Mail::SpamAssassin::PerMsgStatus::PORN_12_body_test
 1.13   0.733 20.282    804   0.0009 0.0252  Mail::SpamAssassin::PerMsgStatus::run_eval_tests
 1.12   0.730  0.978    201   0.0036 0.0049  Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests
 1.03   0.670  0.662   8462   0.0001 0.0001  Mail::SpamAssassin::PerMsgStatus::PORN_9_body_test
 1.00   0.650  3.529  27165   0.0000 0.0001  Mail::SpamAssassin::PerMsgStatus::get
 0.89   0.580  0.702   1608   0.0004 0.0004  Mail::SpamAssassin::NoMailAudit::get_all_headers
 0.75   0.490  0.482   8462   0.0001 0.0001  
Mail::SpamAssassin::PerMsgStatus::NIGERIAN_SCAM_7_body_test


If it's not a perl version thing, maybe you could attach a tarball of some sample spams for which 
you're seeing DOUBLE_CAPSWORD take a long time on your machine.  I'll try running those same 
messages here...
Comment 9 Rod Begbie 2002-06-07 07:04:12 UTC
These were my results, running over 200 spams.  No problems with usage here.  
I'll post an update with Spam/Nonspam hits after running a full mass-check to 
see if it's a *useful* rule.

rOD.


Total Elapsed Time = 132.5689 Seconds
  User+System Time = 125.1589 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 31.5   39.51 49.518    186   0.2125 0.2662  Mail::SpamAssassin::TextCat::class
                                             ify
 11.2   14.02 32.811    201   0.0698 0.1632  Mail::SpamAssassin::PerMsgStatus::
                                             _body_tests
 7.98   9.990  9.990    186   0.0537 0.0537  Mail::SpamAssassin::TextCat::creat
                                             e_lm
 4.84   6.060  6.060    201   0.0301 0.0301  Mail::SpamAssassin::PerMsgStatus::
                                             porn_word_test
 3.97   4.969  5.432    201   0.0247 0.0270  Mail::SpamAssassin::PerMsgStatus::
                                             get_decoded_stripped_body_text_arr
                                             ay
 3.59   4.499  4.452  32261   0.0001 0.0001  Mail::SpamAssassin::NoMailAudit::_
                                             get_header_list
 3.50   4.377  5.759    201   0.0218 0.0287  Mail::SpamAssassin::PhraseFreqs::c
                                             heck_phrase_freqs
 2.67   3.344  7.208    201   0.0166 0.0359  Mail::SpamAssassin::PerMsgStatus::
                                             _rawbody_tests
 1.68   2.099  9.500  27001   0.0001 0.0004  Mail::SpamAssassin::PerMsgStatus::
                                             get
 1.65   2.069  6.378  31541   0.0001 0.0002  Mail::SpamAssassin::NoMailAudit::g
                                             et_header
 1.35   1.690 69.165    804   0.0021 0.0860  Mail::SpamAssassin::PerMsgStatus::
                                             run_eval_tests
 1.14   1.430  1.352  52259   0.0000 0.0000  Mail::SpamAssassin::PhraseFreqs::t
                                             est_word_pair
 1.13   1.420  1.708   1608   0.0009 0.0011  Mail::SpamAssassin::NoMailAudit::g
                                             et_all_headers
 0.74   0.930  1.207    201   0.0046 0.0060  Mail::SpamAssassin::PerMsgStatus::
                                             RATWARE_head_test
 0.56   0.700  0.693   4667   0.0001 0.0001  Mail::SpamAssassin::PerMsgStatus::
                                             PORN_12_body_test
[rod@blazing masses]$
Comment 10 Rod Begbie 2002-06-07 09:20:22 UTC
It matched 50% of spams, but 36% of non-spams in my corpus.  Doesn't strike me 
as a terribly useful rule.

Just a thought -- Is it catching HTML code?  eg <A HREF="foo">bar</A><A 
HREF="fred">sheila</A>

[rod@blazing masses]$ wc < spam.log
   2083    8329  536660
[rod@blazing masses]$ grep DOUBL spam.log | wc
   1032    4128  296857
[rod@blazing masses]$ wc < nonspam.log
   4270   16711  571530
[rod@blazing masses]$ grep DOUBLE nonspam.log |wc
   1532    6128  255381
Comment 11 Daniel Quinlan 2002-06-07 10:35:31 UTC
I am using perl 5.6.1 from Debian 3.0 testing:

ii  perl           5.6.1-7        Larry Wall's Practical Extraction and Report
Comment 12 Rod Begbie 2002-06-07 11:06:41 UTC
Further to my prior comment, it looks like HTML tags are to blame.  eg.

<STYLE></STYLE>
<DIV>&nbsp;</DIV>

(Both of which are hugely common occurrences in HTML emails created in Outlook)

We need to ignore HTML tags or dispose of this rule.
Comment 13 Craig Hughes 2002-06-07 21:54:00 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good

It's a body rule, so it ought to be working on text which has been HTML
stripped...

Comment 14 Daniel Quinlan 2002-06-07 22:09:26 UTC
Here's the problem.

file: lib/Mail/SpamAssassin/PerMsgStatus.pm
function: get_decoded_stripped_body_text_array()
code:

  # join all consecutive whitespace into a single space
  $text =~ s/\s+/ /sg;

this has the effect of making lines longer.  In fact, the only newlines
are the paragraph breaks added later in the function.  Since uuencoded text
has no paragraph breaks at all, uuencoded text turns into SUPER-long lines.

Craig already found out that backtracking is slow for long lines.  I tried
changing the above line to:

  $text =~ s/[ \t]+/ /sg;

and it sure did speed up DOUBLE_CAPSWORD, but things as a whole got slower.

It seems like we need to:

a) solve uuencoded text in the decoding functions (regardless)
b) if we want to leave lines joined up, remove DOUBLE_CAPSWORD or make it an
   eval function.

Comment 15 Craig Hughes 2002-06-09 19:47:47 UTC
Ok, the long lines are the problem for DOUBLE_CAPS_WORD, so let's constrain the rule to not 
backtrack over huge chunks of long line text:

body DOUBLE_CAPS_WORD    /\b([A-Z]{3,})\b.{,30}\b\1\b/

How about that?  Or is 30 too much?  I think intuitively it's about right, given an average word 
length in english of 4.5 characters, that's two identical ALL CAPS words, separated by up to 6 or 
so other words.
Comment 16 Duncan Findlay 2002-06-09 20:00:37 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good

> How about that?  Or is 30 too much?  I think intuitively it's about right, given an average word 

I think we could do more without serious problems, given that mail
without long lines works almost instantly. I'd say 50. But, that might
not make any difference, since spammers tend to like to use stuff
like:

FREE FREE FREE

FREE OFFER!!!!   FREE OFFER!!!!

etc.


Comment 17 Duncan Findlay 2002-06-09 20:05:54 UTC
Subject: Re: [SAdev]  DOUBLE_CAPSWORD test is no good


> body DOUBLE_CAPS_WORD    /\b([A-Z]{3,})\b.{,30}\b\1\b/

I think you need to make that /\b([A-Z]{3,})\b.{0,30}\b\1\b/ 30 is
fine, I think more would also be fine (in terms of performance), but
probably wouldn't be effective.

Comment 18 Craig Hughes 2002-06-10 03:15:54 UTC
Ok, checked in {0,50}
Comment 19 Rod Begbie 2002-06-13 12:33:56 UTC
Created attachment 175 [details]
This is being tagged as DOUBLE_CAPSWORD.  Why?
Comment 20 Rod Begbie 2002-06-13 12:40:00 UTC
I've just posted an example of a spam that has been tagged DOUBLE_CAPSWORD, but 
I don't know why.  The only three all-caps words in the body of the mail 
are "BRING IT ON".

Apart from that, "MIME" appears several times in heading information.  And 
there's loads of HTML, which we think is getting stripped.

Just strikes me that there is room for many false positives with this rule -- 
It's tripping on cases that it wasn't designed for.  I'll go find a non-spam to 
post too.
Comment 21 Rod Begbie 2002-06-13 12:52:18 UTC
Created attachment 176 [details]
A non-spam that triggers DOUBLE_CAPSWORD, but shouldn't.
Comment 22 Rod Begbie 2002-06-13 12:55:18 UTC
OK, here's another one.  I can see a variety of ways that this could 
*mistakenly* trigger DOUBLE_CAPSWORD, but I don't think it should.  (Of course, 
I add the disclaimer that I only slept four hours last night, so probably 
missed something obvious).

Can someone take a look and work it out?
Comment 23 Daniel Quinlan 2002-06-13 17:40:41 UTC
I still believe this rule is of questionable value (even if it worked halfway
correctly).  Anyway, I figured out what was causing your false positives:

06/13/02 12:33 - MIME appears twice in the body
06/13/02 12:52 - <X-TAB> and </X-TAB> include a '-' character and are not
                 standard HTML tags so SA "fails" to strip them out.

Problems:

1) since it is a body test, it is per-paragraph and not per-line!
   ANY paragraph that includes an acronym twice will match.
2) can HTML/XML tags include a hyphen?

In my nonspam corpus, I have hundreds of matches because of problem #1.
Most are computer acronyms: LSB, BIOS, FPGA, CERT, IBM, IDE, NFS, USB, HTML,
and so on.
Comment 24 Rod Begbie 2002-06-13 20:45:18 UTC
Agreed on the dubious value -- Other matches I was getting in my nonspam corpus 
were on the "words" DVD, USA and the yahoogroups footer which advertises "Get 
your FREE credit report with a FREE CreditCheck".

Can't wait to see what the GA makes of it :)
Comment 25 Justin Mason 2002-07-10 06:57:37 UTC
I've made it a rawbody test, so it actually matches line-by-line. Otherwise
words in all caps (eg. a headline) repeated anywhere in the *next few* lines (eg.
story leader) were getting a hit.  Also made min wordsize 4 letters for a bit more
sanity.

But I reckon it needs low, low points!
Comment 26 Malte S. Stretz 2002-07-10 14:48:51 UTC
Do rawbody tests have HTML removed? Maybe it should stay a body test but have 
the description changed to 
| A word in all caps repeated in the paragraph