SA Bugzilla – Bug 2970
"longwords" rules
Last modified: 2004-04-20 06:43:19 UTC
Robert Menschel <Robert at Menschel.net> said: Date: Sun, 25 Jan 2004 22:37:03 -0800 Subject: [SAtalk] Longwords Received an email this morning which reminded me about my longwords rules, which apparently got lost when I migrated my mass-check system from my mail server to my PC. This was my exploration of the random words spammers have been including at the bottom of their emails, or in their text portions, or in their invisible text, to confuse some anti-spam software. (I call these words Bayes Fodder, since over time it seems they are helping my Bayes identify spam better and better and better.) Anyway, I rebuilt, reran, refined, and: Section 3 -- Frequencies Log (First numeric frequencies, followed by percentage frequencies) OVERALL SPAM HAM S/O SCORE NAME 91714 74113 17601 0.808 0.00 0.00 (all messages) 7431 7429 2 0.999 1.00 3.00 RM_bpt_longwords68a 6596 6595 1 0.999 0.98 1.00 RM_bpt_longwords69a 4163 4163 0 1.000 0.71 2.00 RM_bpt_longwords78a 8761 8753 8 0.996 0.51 3.00 RM_bpt_longwords59a 2950 2950 0 1.000 0.48 1.00 RM_bpt_longwords79a 1162 1162 0 1.000 0.15 4.00 RM_bpt_longwords96a 1025 1025 0 1.000 0.13 4.00 RM_bpt_longwords88a 590 590 0 1.000 0.05 1.00 RM_bpt_longwords89a 545 545 0 1.000 0.04 3.00 RM_bpt_longwords97 442 442 0 1.000 0.02 1.00 RM_bpt_longwords98 330 330 0 1.000 0.00 1.00 RM_bpt_longwords99 OVERALL% SPAM% HAM% S/O RANK SCORE NAME 91714 74113 17601 0.808 0.00 0.00 (all messages) 100.000 80.8088 19.1912 0.808 0.00 0.00 (all messages as %) 8.102 10.0239 0.0114 0.999 1.00 3.00 RM_bpt_longwords68a 7.192 8.8986 0.0057 0.999 0.98 1.00 RM_bpt_longwords69a 4.539 5.6171 0.0000 1.000 0.71 2.00 RM_bpt_longwords78a 9.553 11.8103 0.0455 0.996 0.51 3.00 RM_bpt_longwords59a 3.217 3.9804 0.0000 1.000 0.48 1.00 RM_bpt_longwords79a 1.267 1.5679 0.0000 1.000 0.15 4.00 RM_bpt_longwords96a 1.118 1.3830 0.0000 1.000 0.13 4.00 RM_bpt_longwords88a 0.643 0.7961 0.0000 1.000 0.05 1.00 RM_bpt_longwords89a 0.594 0.7354 0.0000 1.000 0.04 3.00 RM_bpt_longwords97 0.482 0.5964 0.0000 1.000 0.02 1.00 RM_bpt_longwords98 0.360 0.4453 0.0000 1.000 0.00 1.00 RM_bpt_longwords99 Scores of course are set to my 9.0 required hits, so you'll probably want to lower these scores. Depending on your system, an initial score of 0.5 or 1.0 for each rule might be worth while, and then you can increase the scores slowly if these spam continue to sneak past your system. In my 19k corpus, one ham matches three of these rules, two of which I've scored at 3.0, and so that ham gets a score of 7.0 of 9. I may be reducing those rules to 2.5 or 2.0 instead of 3.0 once I complete my next global mass-check. So yes, caution is advised. Bob Menschel body RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/ describe RM_bpt_longwords68a Long string of long words score RM_bpt_longwords68a 3.000 # 7429s/2h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list, # "improving compatibility between computer platforms demands certain levels " body RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/ describe RM_bpt_longwords69a Long string of long words score RM_bpt_longwords69a 1.000 # type=max:1 (add to 59a,68a) - 6595s/1h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list body RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/ describe RM_bpt_longwords78a Long string of long words score RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/ describe RM_bpt_longwords59a Long string of long words score RM_bpt_longwords59a 3.000 # 8753s/8h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list body RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/ describe RM_bpt_longwords79a Long string of long words score RM_bpt_longwords79a 1.000 # type=max:1 (add to 78a) - 2950s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/ describe RM_bpt_longwords96a Long string of long words score RM_bpt_longwords96a 4.000 # 1162s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/ describe RM_bpt_longwords88a Long string of long words score RM_bpt_longwords88a 4.000 # 1025s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/ describe RM_bpt_longwords89a Long string of long words score RM_bpt_longwords89a 1.000 # type=max:1 (add to 88a) - 590s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/ describe RM_bpt_longwords97 Long string of long words score RM_bpt_longwords97 3.000 # 545s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/ describe RM_bpt_longwords98 Long string of long words score RM_bpt_longwords98 1.000 # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/ describe RM_bpt_longwords99 Long string of long words score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04 Given a hitrate of 10% with an S/O of 0.999, we gotta apply them ;) adding to SVN now. (PS: as I read the new Apache 2.0 license, we no longer need to verify CLA receipt for patches/new rules sent by non-committers. right?)
working on this, apparently -- got tired of waiting for it to move forward
I have an important addition for this set: RM_bpt_longwords512 it scored higher than all the rest. Attaching rule next. Section 3 -- Frequencies Log (First numeric frequencies, followed by percentage frequencies) OVERALL SPAM HAM S/O SCORE NAME 125083 104895 20188 0.839 0.00 0.00 (all messages) 20911 20908 3 0.999 1.00 2.00 RM_bpt_longwords512 20129 20126 3 0.999 1.00 1.00 RM_bpt_longwords68a 23236 23225 11 0.998 0.99 1.00 RM_bpt_longwords59a 18138 18136 2 0.999 0.99 1.00 RM_bpt_longwords69a 11766 11766 0 1.000 0.96 2.00 RM_bpt_longwords78a 8662 8662 0 1.000 0.94 1.00 RM_bpt_longwords79a
Created attachment 1884 [details] sample rule
I think there's already a form of this in SVN trunk. closing...
That rule: 20911 20908 3 0.999 1.00 2.00 RM_bpt_longwords512 Was developed after the initial longwords rule and it's doing better than all the rest, if you added the previous rules, it's worth consideration to add this one as well.