SA Bugzilla – Bug 369
DOUBLE_CAPSWORD test is no good
Last modified: 2002-07-10 06:48:51 UTC
This test matches such non-spammy phrases as: I think I will be on vacation next week. and We will release LSB 1.3 six months after LSB 1.2. In my corpus 3133/4824 non-spam messages match (65%) and 1030/1322 spam messages match (78%).
It worse now! With ok_languages turned on, DOUBLE_CAPSWORD is the second slowest test! Total Elapsed Time = 56.54792 Seconds User+System Time = 56.54792 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 20.2 11.43 11.430 25 0.4572 0.4572 Mail::SpamAssassin::TextCat::creat e_lm 19.7 11.19 11.187 752 0.0149 0.0149 Mail::SpamAssassin::PerMsgStatus:: DOUBLE_CAPSWORD_body_test It's not really clear that this rule works especially well: $ egrep -c DOUBLE_CAPSWORD *.log nonspam.log:1563 spam.log:840 The rule matches only about twice as often on spam as it does on non-spam. Or, another way to put it is that 1/3 of the matches are on non-spam.
Are you sure you're using the latest CVS rules? bash2.05 craig@balam ~/code/spamassassin % perl -d:DProf tools/speedtest t/data/nice/* t/data/ spam/* 2.6 t/data/nice/001 PLING,DOUBLE_CAPSWORD,LINES_OF_YELLING_2,LINES_OF_YELLING -2.2 t/data/nice/002 IN_REP_TO,X_AUTH_WARNING,MSG_ID_ADDED_BY_MTA_3 6.8 t/data/nice/CVS INVALID_DATE,FROM_MISSING,X_NOT_PRESENT,DATE_MISSING,SUBJ_MISSING,MISSI NG_HEADERS -3.3 t/data/nice/base64.txt IN_REP_TO,SUBJ_ENDS_IN_Q_MARK,DOUBLE_CAPSWORD,MIME_NULL_BLOCK 25.3 t/data/spam/001 ALL_CAPS_HEADER,FROM_HAS_MIXED_NUMS,INVALID_MSGID,INVALID_DATE,MAY_BE_F ORGED,MSGID_HAS_NO_AT,UNDISC_RECIPS,FROM_ENDS_IN_NUMS,NO_REAL_NAME,PLI NG,X_NOT_PRESENT,FOR_FREE,CLICK_BELOW,TO_BE_REMOVED_REPLY,EXCUSE_12,REMO VE_SUBJ,REMOVE_IN_QUOTES,EXCUSE_4,NORMAL_HTTP_TO_IP,FREQ_SPAM_PHRASE,FO RGED_YAHOO_RCVD,DATE_IN_FUTURE_03_06 11.7 t/data/spam/002 INVALID_DATE,UNDISC_RECIPS,ADVERT_CODE,NO_REAL_NAME,FORGED_RCVD_FOUND,X _NOT_PRESENT,EXCUSE_4,REMOVE_PAGE,SUBJ_ALL_CAPS 16 t/data/spam/003 FROM_ENDS_IN_NUMS,NO_REAL_NAME,DOUBLE_CAPSWORD,ALL_NATURAL,CLICK_BELO W,EXCUSE_3,NUMERIC_HTTP_ADDR,MAILTO_TO_SPAM_ADDR,CLICK_HERE_LINK,SUBJ_AL L_CAPS,FORGED_HOTMAIL_RCVD,DATE_IN_PAST_12_24 8.6 t/data/spam/004 SUBJ_HAS_SPACES,VERY_SUSP_CC_RECIPS,INVALID_DATE,PLING,X_NOT_PRESENT,MAIL TO_WITH_SUBJ,SUBJ_HAS_UNIQ_ID,DATE_IN_FUTURE_06_12 6.8 t/data/spam/CVS INVALID_DATE,FROM_MISSING,X_NOT_PRESENT,DATE_MISSING,SUBJ_MISSING,MISSI NG_HEADERS 7 t/data/spam/base64.txt NO_REAL_NAME,EXCUSE_3,DOUBLE_CAPSWORD,MAILTO_TO_REMOVE,BASE64_ENC_TEXT bash2.05 craig@balam ~/code/spamassassin % dprofpp -O30 Total Elapsed Time = 3.738573 Seconds User+System Time = 2.184542 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 16.2 0.355 0.960 10 0.0355 0.0960 Mail::SpamAssassin::PerMsgStatus::_body_tests 8.79 0.192 0.205 1 0.1923 0.2046 Mail::SpamAssassin::Conf::_parse 8.70 0.190 0.169 1373 0.0001 0.0001 Mail::SpamAssassin::NoMailAudit::_get_header_list 8.24 0.180 0.180 10 0.0180 0.0180 Mail::SpamAssassin::PerMsgStatus::porn_word_test 7.74 0.169 0.391 1311 0.0001 0.0003 Mail::SpamAssassin::PerMsgStatus::get 6.87 0.150 1.109 10 0.0150 0.1109 Mail::SpamAssassin::PerMsgStatus::do_body_tests 6.82 0.149 1.727 32 0.0046 0.0540 Mail::SpamAssassin::PerMsgStatus::BEGIN 5.91 0.129 0.586 40 0.0032 0.0146 Mail::SpamAssassin::PerMsgStatus::run_eval_tests 5.58 0.122 0.148 10 0.0122 0.0148 Mail::SpamAssassin::PerMsgStatus::_rawbody_tests 5.04 0.110 0.236 1373 0.0001 0.0002 Mail::SpamAssassin::NoMailAudit::get_header 5.04 0.110 0.757 13 0.0084 0.0582 Mail::SpamAssassin::BEGIN 4.07 0.089 0.104 30 0.0030 0.0035 Net::DNS::RR::BEGIN 3.66 0.080 0.461 10 0.0080 0.0461 Mail::SpamAssassin::PerMsgStatus::do_head_tests 2.88 0.063 0.050 10 0.0063 0.0050 Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs 2.75 0.060 0.059 80 0.0007 0.0007 Mail::SpamAssassin::NoMailAudit::get_all_headers 2.75 0.060 0.059 10 0.0060 0.0059 Mail::SpamAssassin::PerMsgStatus::RATWARE_head_test 2.75 0.060 0.098 6 0.0100 0.0164 FindBin::BEGIN 2.75 0.060 0.184 14 0.0043 0.0132 Razor::Client::BEGIN 2.29 0.050 0.050 10 0.0050 0.0050 Exporter::heavy_export 2.29 0.050 0.904 5 0.0099 0.1808 main::BEGIN 2.29 0.050 0.058 6 0.0083 0.0097 IO::Socket::BEGIN 1.83 0.040 0.038 129 0.0003 0.0003 Mail::SpamAssassin::PerMsgStatus::ONE_HUNDRED_PC_FREE_body_test 1.83 0.040 0.072 10 0.0040 0.0072 Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests 1.37 0.030 0.028 129 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::FREE_MONEY_body_test 1.37 0.030 0.030 2 0.0150 0.0149 Mail::SpamAssassin::read_cf 1.37 0.030 0.028 129 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test 1.37 0.030 0.030 3 0.0100 0.0100 Cwd::abs_path 1.37 0.030 0.028 129 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::EXCUSE_17_body_test 1.37 0.030 0.028 129 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::MAIL_IN_ORDER_FORM_body_test 1.37 0.030 0.020 660 0.0000 0.0000 Mail::SpamAssassin::PerMsgStatus::clear_test_state
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good > Are you sure you're using the latest CVS rules? Yes. I think the problem is the \1 backreference. No other body rules have backreferences. Several possibilities off the top of my head: (1) I run my tests over all messages because I want to make sure that some rules don't completely fall apart on large messages. It could be that not running it over large messages avoids the problem. (2) My machine only has 176MB of RAM. It could be that you have enough RAM that DOUBLE_CAPSWORD doesn't thrash the machine -- like it did mine. (3) I'm using perl v5.6.1. Also, the test does not appear to be very good at catching spam. Perhaps revise to something like this? /(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER).*(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER)/
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good > Are you sure you're using the latest CVS rules? Yes. I think the problem is the \1 backreference. No other body rules have backreferences. Several possibilities off the top of my head: (1) I run my tests over all messages because I want to make sure that some rules don't completely fall apart on large messages. It could be that not running it over large messages avoids the problem. (2) My machine only has 176MB of RAM. It could be that you have enough RAM that DOUBLE_CAPSWORD doesn't thrash the machine -- like it did mine. (3) I'm using perl v5.6.1. Also, the test does not appear to be very good at catching spam. Perhaps revise to something like this? /(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER).*(?:FREE|FDA|OTC|SEC|CEO|REPORT|PLEASE|ORDER)/ _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-devel mailing list Spamassassin-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-devel
Ok, well I ran current CVS against about 200 messages from the spam corpus, and got this: bash2.05 craig@balam ~/code/spamassassin % dprofpp Total Elapsed Time = 37.72486 Seconds User+System Time = 21.85131 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 64.6 14.12 34.417 111 0.1272 0.3101 Mail::SpamAssassin::PerMsgStatus::_body_tests 18.3 4.008 4.008 111 0.0361 0.0361 Mail::SpamAssassin::PerMsgStatus::porn_word_test 15.3 3.361 6.483 111 0.0303 0.0584 Mail::SpamAssassin::PerMsgStatus::_rawbody_tests 12.2 2.686 2.456 15626 0.0002 0.0002 Mail::SpamAssassin::NoMailAudit::_get_header_list 8.02 1.752 5.329 14978 0.0001 0.0004 Mail::SpamAssassin::PerMsgStatus::get 7.99 1.745 3.737 15626 0.0001 0.0002 Mail::SpamAssassin::NoMailAudit::get_header 6.88 1.504 2.352 111 0.0135 0.0212 Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs 6.58 1.438 0.831 40571 0.0000 0.0000 Mail::SpamAssassin::PhraseFreqs::test_word_pair 6.48 1.417 1.755 111 0.0128 0.0158 Mail::SpamAssassin::PerMsgStatus::get_decoded_stripped_body_text_array 5.16 1.127 10.990 444 0.0025 0.0248 Mail::SpamAssassin::PerMsgStatus::run_eval_tests 4.89 1.068 1.006 4271 0.0003 0.0002 Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test 3.93 0.858 0.902 111 0.0077 0.0081 Mail::SpamAssassin::PerMsgStatus::RATWARE_head_test 3.41 0.746 1.291 111 0.0067 0.0116 Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests 3.29 0.719 0.656 4271 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::PORN_9_body_test 3.15 0.689 0.626 4271 0.0002 0.0001 Mail::SpamAssassin::PerMsgStatus::PORN_12_body_test
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good Craig, for your 200 message run, it kinda looks like it only got ran on 111 messages. I just ran this on the current CVS with no changes whatsoever. I do have "ok_languages en" in "masses/spamassassin.prefs". Maybe there's an interaction. $ perl -d:DProf ./mass-check --mh --head=200 --sort --all mail/spam > /dev/null and got this: $ dprofpp Total Elapsed Time = 227.3120 Seconds User+System Time = 222.6920 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 21.4 47.79 85.035 191 0.2503 0.4452 Mail::SpamAssassin::TextCat::class ify 16.7 37.21 37.219 191 0.1949 0.1949 Mail::SpamAssassin::TextCat::creat e_lm 9.14 20.35 20.359 201 0.1013 0.1013 Mail::SpamAssassin::PerMsgStatus:: porn_word_test 8.83 19.66 19.677 7740 0.0025 0.0025 Mail::SpamAssassin::PerMsgStatus:: DOUBLE_CAPSWORD_body_test 8.78 19.54 20.531 201 0.0973 0.1021 Mail::SpamAssassin::PerMsgStatus:: get_decoded_stripped_body_text_arr ay 6.22 13.84 75.161 201 0.0689 0.3739 Mail::SpamAssassin::PerMsgStatus:: _body_tests 2.96 6.588 12.700 201 0.0328 0.0632 Mail::SpamAssassin::PerMsgStatus:: _rawbody_tests 1.97 4.388 5.707 201 0.0218 0.0284 Mail::SpamAssassin::PhraseFreqs::c heck_phrase_freqs 1.56 3.479 3.374 30323 0.0001 0.0001 Mail::SpamAssassin::NoMailAudit::_ get_header_list 1.06 2.369 2.342 7740 0.0003 0.0003 Mail::SpamAssassin::PerMsgStatus:: PORN_10_body_test Here's another identical run with an empty "masses/spamassassin.prefs". Seems to shoot down the interaction theory. Note that PORN_10 is run just as many times, but doesn't take nearly as long. I'll try to figure out which messages are taking the longest. $ dprofpp Total Elapsed Time = 141.9831 Seconds User+System Time = 141.3831 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 14.4 20.43 20.439 201 0.1017 0.1017 Mail::SpamAssassin::PerMsgStatus:: porn_word_test 13.8 19.64 19.672 7740 0.0025 0.0025 Mail::SpamAssassin::PerMsgStatus:: DOUBLE_CAPSWORD_body_test 13.7 19.40 20.383 201 0.0966 0.1014 Mail::SpamAssassin::PerMsgStatus:: get_decoded_stripped_body_text_arr ay 10.6 14.99 76.498 201 0.0746 0.3806 Mail::SpamAssassin::PerMsgStatus:: _body_tests 4.61 6.523 13.036 201 0.0325 0.0649 Mail::SpamAssassin::PerMsgStatus:: _rawbody_tests 3.10 4.385 5.766 201 0.0218 0.0287 Mail::SpamAssassin::PhraseFreqs::c heck_phrase_freqs 2.55 3.609 3.519 30292 0.0001 0.0001 Mail::SpamAssassin::NoMailAudit::_ get_header_list 1.58 2.239 2.226 7740 0.0003 0.0003 Mail::SpamAssassin::PerMsgStatus:: PORN_10_body_test 1.47 2.079 2.167 201 0.0103 0.0108 Mail::SpamAssassin::PerMsgStatus:: RATWARE_head_test 1.47 2.079 5.383 29642 0.0001 0.0002 Mail::SpamAssassin::NoMailAudit::g et_header
Out of curiosity, what version of perl are you using? [craig@belphegore craig]$ perl -V:version version='5.6.1';
Ok, ran again, this time using mass-check instead of speedtest, and checking against a different corpus chunk, some of which includes some large hex messages, etc: [craig@belphegore spamassassin]$ dprofpp Total Elapsed Time = 65.86101 Seconds User+System Time = 65.06101 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 23.8 15.52 38.793 201 0.0772 0.1930 Mail::SpamAssassin::PerMsgStatus::_body_tests 19.1 12.43 13.966 201 0.0619 0.0695 Mail::SpamAssassin::PhraseFreqs::check_phrase_freqs 4.32 2.810 2.810 201 0.0140 0.0140 Mail::SpamAssassin::PerMsgStatus::porn_word_test 3.34 2.170 4.119 201 0.0108 0.0205 Mail::SpamAssassin::PerMsgStatus::_rawbody_tests 2.94 1.910 1.878 31656 0.0001 0.0001 Mail::SpamAssassin::NoMailAudit::_get_header_list 2.43 1.580 1.484 95709 0.0000 0.0000 Mail::SpamAssassin::PhraseFreqs::test_word_pair 1.60 1.040 1.031 8462 0.0001 0.0001 Mail::SpamAssassin::PerMsgStatus::PORN_10_body_test 1.40 0.910 2.676 31218 0.0000 0.0001 Mail::SpamAssassin::NoMailAudit::get_header 1.21 0.790 0.781 8462 0.0001 0.0001 Mail::SpamAssassin::PerMsgStatus::PORN_12_body_test 1.13 0.733 20.282 804 0.0009 0.0252 Mail::SpamAssassin::PerMsgStatus::run_eval_tests 1.12 0.730 0.978 201 0.0036 0.0049 Mail::SpamAssassin::PerMsgStatus::do_body_uri_tests 1.03 0.670 0.662 8462 0.0001 0.0001 Mail::SpamAssassin::PerMsgStatus::PORN_9_body_test 1.00 0.650 3.529 27165 0.0000 0.0001 Mail::SpamAssassin::PerMsgStatus::get 0.89 0.580 0.702 1608 0.0004 0.0004 Mail::SpamAssassin::NoMailAudit::get_all_headers 0.75 0.490 0.482 8462 0.0001 0.0001 Mail::SpamAssassin::PerMsgStatus::NIGERIAN_SCAM_7_body_test If it's not a perl version thing, maybe you could attach a tarball of some sample spams for which you're seeing DOUBLE_CAPSWORD take a long time on your machine. I'll try running those same messages here...
These were my results, running over 200 spams. No problems with usage here. I'll post an update with Spam/Nonspam hits after running a full mass-check to see if it's a *useful* rule. rOD. Total Elapsed Time = 132.5689 Seconds User+System Time = 125.1589 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 31.5 39.51 49.518 186 0.2125 0.2662 Mail::SpamAssassin::TextCat::class ify 11.2 14.02 32.811 201 0.0698 0.1632 Mail::SpamAssassin::PerMsgStatus:: _body_tests 7.98 9.990 9.990 186 0.0537 0.0537 Mail::SpamAssassin::TextCat::creat e_lm 4.84 6.060 6.060 201 0.0301 0.0301 Mail::SpamAssassin::PerMsgStatus:: porn_word_test 3.97 4.969 5.432 201 0.0247 0.0270 Mail::SpamAssassin::PerMsgStatus:: get_decoded_stripped_body_text_arr ay 3.59 4.499 4.452 32261 0.0001 0.0001 Mail::SpamAssassin::NoMailAudit::_ get_header_list 3.50 4.377 5.759 201 0.0218 0.0287 Mail::SpamAssassin::PhraseFreqs::c heck_phrase_freqs 2.67 3.344 7.208 201 0.0166 0.0359 Mail::SpamAssassin::PerMsgStatus:: _rawbody_tests 1.68 2.099 9.500 27001 0.0001 0.0004 Mail::SpamAssassin::PerMsgStatus:: get 1.65 2.069 6.378 31541 0.0001 0.0002 Mail::SpamAssassin::NoMailAudit::g et_header 1.35 1.690 69.165 804 0.0021 0.0860 Mail::SpamAssassin::PerMsgStatus:: run_eval_tests 1.14 1.430 1.352 52259 0.0000 0.0000 Mail::SpamAssassin::PhraseFreqs::t est_word_pair 1.13 1.420 1.708 1608 0.0009 0.0011 Mail::SpamAssassin::NoMailAudit::g et_all_headers 0.74 0.930 1.207 201 0.0046 0.0060 Mail::SpamAssassin::PerMsgStatus:: RATWARE_head_test 0.56 0.700 0.693 4667 0.0001 0.0001 Mail::SpamAssassin::PerMsgStatus:: PORN_12_body_test [rod@blazing masses]$
It matched 50% of spams, but 36% of non-spams in my corpus. Doesn't strike me as a terribly useful rule. Just a thought -- Is it catching HTML code? eg <A HREF="foo">bar</A><A HREF="fred">sheila</A> [rod@blazing masses]$ wc < spam.log 2083 8329 536660 [rod@blazing masses]$ grep DOUBL spam.log | wc 1032 4128 296857 [rod@blazing masses]$ wc < nonspam.log 4270 16711 571530 [rod@blazing masses]$ grep DOUBLE nonspam.log |wc 1532 6128 255381
I am using perl 5.6.1 from Debian 3.0 testing: ii perl 5.6.1-7 Larry Wall's Practical Extraction and Report
Further to my prior comment, it looks like HTML tags are to blame. eg. <STYLE></STYLE> <DIV> </DIV> (Both of which are hugely common occurrences in HTML emails created in Outlook) We need to ignore HTML tags or dispose of this rule.
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good It's a body rule, so it ought to be working on text which has been HTML stripped...
Here's the problem. file: lib/Mail/SpamAssassin/PerMsgStatus.pm function: get_decoded_stripped_body_text_array() code: # join all consecutive whitespace into a single space $text =~ s/\s+/ /sg; this has the effect of making lines longer. In fact, the only newlines are the paragraph breaks added later in the function. Since uuencoded text has no paragraph breaks at all, uuencoded text turns into SUPER-long lines. Craig already found out that backtracking is slow for long lines. I tried changing the above line to: $text =~ s/[ \t]+/ /sg; and it sure did speed up DOUBLE_CAPSWORD, but things as a whole got slower. It seems like we need to: a) solve uuencoded text in the decoding functions (regardless) b) if we want to leave lines joined up, remove DOUBLE_CAPSWORD or make it an eval function.
Ok, the long lines are the problem for DOUBLE_CAPS_WORD, so let's constrain the rule to not backtrack over huge chunks of long line text: body DOUBLE_CAPS_WORD /\b([A-Z]{3,})\b.{,30}\b\1\b/ How about that? Or is 30 too much? I think intuitively it's about right, given an average word length in english of 4.5 characters, that's two identical ALL CAPS words, separated by up to 6 or so other words.
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good > How about that? Or is 30 too much? I think intuitively it's about right, given an average word I think we could do more without serious problems, given that mail without long lines works almost instantly. I'd say 50. But, that might not make any difference, since spammers tend to like to use stuff like: FREE FREE FREE FREE OFFER!!!! FREE OFFER!!!! etc.
Subject: Re: [SAdev] DOUBLE_CAPSWORD test is no good > body DOUBLE_CAPS_WORD /\b([A-Z]{3,})\b.{,30}\b\1\b/ I think you need to make that /\b([A-Z]{3,})\b.{0,30}\b\1\b/ 30 is fine, I think more would also be fine (in terms of performance), but probably wouldn't be effective.
Ok, checked in {0,50}
Created attachment 175 [details] This is being tagged as DOUBLE_CAPSWORD. Why?
I've just posted an example of a spam that has been tagged DOUBLE_CAPSWORD, but I don't know why. The only three all-caps words in the body of the mail are "BRING IT ON". Apart from that, "MIME" appears several times in heading information. And there's loads of HTML, which we think is getting stripped. Just strikes me that there is room for many false positives with this rule -- It's tripping on cases that it wasn't designed for. I'll go find a non-spam to post too.
Created attachment 176 [details] A non-spam that triggers DOUBLE_CAPSWORD, but shouldn't.
OK, here's another one. I can see a variety of ways that this could *mistakenly* trigger DOUBLE_CAPSWORD, but I don't think it should. (Of course, I add the disclaimer that I only slept four hours last night, so probably missed something obvious). Can someone take a look and work it out?
I still believe this rule is of questionable value (even if it worked halfway correctly). Anyway, I figured out what was causing your false positives: 06/13/02 12:33 - MIME appears twice in the body 06/13/02 12:52 - <X-TAB> and </X-TAB> include a '-' character and are not standard HTML tags so SA "fails" to strip them out. Problems: 1) since it is a body test, it is per-paragraph and not per-line! ANY paragraph that includes an acronym twice will match. 2) can HTML/XML tags include a hyphen? In my nonspam corpus, I have hundreds of matches because of problem #1. Most are computer acronyms: LSB, BIOS, FPGA, CERT, IBM, IDE, NFS, USB, HTML, and so on.
Agreed on the dubious value -- Other matches I was getting in my nonspam corpus were on the "words" DVD, USA and the yahoogroups footer which advertises "Get your FREE credit report with a FREE CreditCheck". Can't wait to see what the GA makes of it :)
I've made it a rawbody test, so it actually matches line-by-line. Otherwise words in all caps (eg. a headline) repeated anywhere in the *next few* lines (eg. story leader) were getting a hit. Also made min wordsize 4 letters for a bit more sanity. But I reckon it needs low, low points!
Do rawbody tests have HTML removed? Maybe it should stay a body test but have the description changed to | A word in all caps repeated in the paragraph