SA Bugzilla – Bug 668
Some rules
Last modified: 2002-08-22 20:08:47 UTC
Some rules. Could somebody with a big corpus test these? uri LINK_TO_EXE /\.exe$/i describe LINK_TO_EXE Contains link to a windows executable test LINK_TO_EXE ok http://cut/member46/test-zugang/free-sexsoftware.exe body NIGERIAN_SCAM_ATTN /^ATTN:.{0,20}\/C\W{0,3}E\W{0,3}O/mi describe NIGERIAN_SCAM_ATTN Frequent introduction in Nigerian scam test NIGERIAN_SCAM_ATTN ok Attn:managing director/CEO test NIGERIAN_SCAM_ATTN ok ATTN: DIRECTOR/C.E.O body LIVE_SOMETHING_CAPS /\bLIVE\s*[A-Z]{3,}\b/ describe LIVE_SOMETHING_CAPS Talks about LIVE[...] in all caps test LIVE_SOMETHING_CAPS ok zur geilsten LIVE LESBEN SHOW??? test LIVE_SOMETHING_CAPS fail live on CNN
Subject: Re: [SAdev] New: Some rules On Wed, Aug 07, 2002 at 10:17:49AM -0700, bugzilla-daemon@hughes-family.org wrote: > Some rules. Could somebody with a big corpus test these? > > uri LINK_TO_EXE /\.exe$/i > describe LINK_TO_EXE Contains link to a windows executable > body NIGERIAN_SCAM_ATTN /^ATTN:.{0,20}\/C\W{0,3}E\W{0,3}O/mi > describe NIGERIAN_SCAM_ATTN Frequent introduction in Nigerian scam > body LIVE_SOMETHING_CAPS /\bLIVE\s*[A-Z]{3,}\b/ > describe LIVE_SOMETHING_CAPS Talks about LIVE[...] in all caps OVERALL SPAM NONSPAM S/O SCORE NAME 13027 4446 8581 0.34 0.00 (all messages) 19 19 0 1.00 1.00 LINK_TO_EXE 16 14 2 0.93 1.00 LIVE_SOMETHING_CAPS 9 9 0 1.00 1.00 NIGERIAN_SCAM_ATTN
OVERALL SPAM NONSPAM S/O SCORE NAME 11744 3414 8330 0.29 0.00 (all messages) 101 38 63 0.60 1.00 LIVE_SOMETHING_CAPS 32 5 27 0.31 1.00 LINK_TO_EXE NIGERIAN_SCAM_ATTN didn't trigger. LINK_TO_EXE hit in my nonspam on a lot of software release announcements. (eg "get it now: http://www.ephpod.com/ephpod240.exe") LIVE_SOMETHING_CAPS hit on CD new-release newsletters. (eg "322114 R ASIA - LIVE AT BUDOKAN CD 8.99")
Subject: Re: Some rules BDFO> ------- Additional Comments From rOD-spamassassin@arsecandle.org BDFO> 2002-08-07 14:56 ------- OVERALL SPAM NONSPAM S/O SCORE BDFO> NAME BDFO> 11744 3414 8330 0.29 0.00 (all messages) BDFO> 101 38 63 0.60 1.00 LIVE_SOMETHING_CAPS BDFO> 32 5 27 0.31 1.00 LINK_TO_EXE BDFO> BDFO> NIGERIAN_SCAM_ATTN didn't trigger. Hmm, doesn't sound very good. Better INVALID this bug.
These didn't fare too well on my corpus. OVERALL SPAM NONSPAM S/O SCORE NAME 12121 7739 4382 0.64 0.00 (all messages) 34 33 1 0.95 1.00 LINK_TO_EXE 10 10 0 1.00 1.00 NIGERIAN_SCAM_ATTN 52 42 10 0.70 1.00 LIVE_SOMETHING_CAPS
Guys -- thanks a million for doing rule-QA on these. But would it be possible to use "hit-frequencies -x -p"? the extra stats, and normalisation to percentages, makes it easier to compare the results across corpora.
Will do in future. Is there some place where the hit-frequencies options are documented? I've seen people posting percentages but had no idea how to do so.
er, no, just in the script itself :(
Is there a way to get hit-frequencies to give the %ages *and* the raw numbers? Percentages are nice, but the raw numbers give a much better sense of significance to the percentages.