Bug 584 - More general rule cleanup
Summary: More general rule cleanup
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.40CVS
Hardware: Other other
: P2 normal
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-07-20 14:20 UTC by Theo Van Dinter
Modified: 2002-08-13 08:16 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Patch that implements all the proposed changes patch None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Van Dinter 2002-07-20 14:20:16 UTC
I poked through the 20*.cf rules today and have a few changes to clean up,
improve performance, or hopefully make better matches:

The first are hopeful performance improvements.  Doing (xx|xy|xz) is more
expensive than x(x|y|z)  (I know I could do x[xyz] but assume the string is
longer...)  There were also some () without the ?: so it would try to do
backreferences that are fixed:

-body SENT_IN_COMPLIANCE                /message .{0,10}sen(?:d|t) in compliance
(?:of|with)/i
+body SENT_IN_COMPLIANCE                /message .{0,10}sen[dt] in compliance
(?:of|with)/i

-body EU_EMAIL_OPTOUT           /EU (?:e-?mail opt.?out|e.?commerce) directive/i
+body EU_EMAIL_OPTOUT           /EU e(?:-?mail opt.?out|.?commerce) directive/i

-body NO_COST                    /\bno (?:cost|charge)\b/i
+body NO_COST                    /\bno c(?:ost|harge)\b/i

-body EXCUSE_6                  /\b(?:wish to|click to|To) remove yourself/i
+body EXCUSE_6                  /\b(?:wish |click )?to remove yourself/i

-body EXCUSE_18                 /we do not (?:spam|send unsolicited)/i
+body EXCUSE_18                 /we do not s(?:pam|end unsolicited)/i

-body PRINT_FORM_SIGNATURE      /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i
+body PRINT_FORM_SIGNATURE      /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i

-body DOMAIN_BODY               /\s(\.|dot\s+)(info|biz|name)\s/i
+body DOMAIN_BODY               /\s(?:\.|dot\s+)(?:info|biz|name)\s/i

-rawbody MONSTERHUT             /monsterhut.com/
+rawbody MONSTERHUT             /monsterhut\.com/

-body JODY                      /\b(?:My wife, Jody|Mi esposa, Jody)/
+body JODY                      /\bM(?:y wife|i esposa), Jody/

-body MYCASINOBUILDER           /MYCASINOBUILDER.COM/i
+body MYCASINOBUILDER           /MYCASINOBUILDER\.COM/i

-body     NO_DISSAPOINTMENT      /You won'?t be diss?app?ointed/i
+body     NO_DISSAPOINTMENT      /You won'?t be dis+ap+ointed/i

-body SEARCH_ENGINE_PROMO       
/\b(?:(?:submitt?|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is
+body SEARCH_ENGINE_PROMO       
/\b(?:(?:submit+|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is

-body WHY_WAIT                  /\b(?:why wait|what are you waiting for)\b/i
+body WHY_WAIT                  /\bw(?:hy wait|hat are you waiting for)\b/i

-body NAME_BRAND                        /\b(?:famous name brand|major brand)/i
+body NAME_BRAND                        /\b(?:famous name |major ) brand/i

-body HAIR_LOSS                 /\b(?:thinn?ing|restore|grow|new) hair|hair loss/i
+body HAIR_LOSS                 /\b(?:thin+ing|restore|grow|new) hair|hair loss/i

-body UNCENSORED                 /\buncensored (?:pics|photo)/i
+body UNCENSORED                 /\buncensored p(?:ics|hoto)/i

-header FROM_MALFORMED          From !~
/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset]
+header FROM_MALFORMED          From !~
/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset]

-header PLING_QUERY             Subject =~ /(?:\?.*!|!.*\?)/
+header PLING_QUERY             Subject =~ /\?.*!|!.*\?/

-header SUBJ_HAS_SPACES         Subject =~ /(?:\s{6,}|\t)/
+header SUBJ_HAS_SPACES         Subject =~ /\s{6,}|\t/

-header INVALID_DATE            Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d
]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d)
\d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/
+header INVALID_DATE            Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat),
)?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
(?:\d{2}|\d{4}) \d{2}:\d{2}(?:\:\d{2})?
(?:UT|[A-Z]{3,5}|[+-]\d{4})(?:\s+\(?:.*\))?\s*$/

-header INVALID_DATE_TZ_ABSURD  Date =~ /[-+](?:1[4-9]\d\d|[2-9]\d\d\d)$/
+header INVALID_DATE_TZ_ABSURD  Date =~ /[-+](?:1[4-9]\d{2}|[2-9]\d{3})$/

-header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d\d\d(\s|$)/
+header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d{3}(?:\s|$)/

-header FRIEND_AT_PUBLIC        To =~ /(yourdomain|you|your|public).(com|org|net)/i
+header FRIEND_AT_PUBLIC        To =~
/(?:yourdomain|you|your|public)\.(?:com|org|net)/i

-header DOMAIN_SUBJECT          Subject =~
/(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i
+header DOMAIN_SUBJECT          Subject =~
/(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission)/i

-header FAKED_IP_IN_RCVD                Received =~ /from
[-0-9a-z\._]+_\[\d+\.\d+\.\d+\.\d+\] /i
+header FAKED_IP_IN_RCVD                Received =~ /from
[-0-9a-z\._]+_\[(?:\d+\.){3}\d+\] /i

-header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
<\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s
+header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
<\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s

-header FROM_BTAMAIL            From =~ /\@btamail.net.cn/i
+header FROM_BTAMAIL            From =~ /\@btamail\.net\.cn/i

-header FROM_UGETMORE           From =~ /\@ugetmore4less.net/i
+header FROM_UGETMORE           From =~ /\@ugetmore4less\.net/i

-header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher.com/i
+header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher\.com/i

-header Q_FOR_SELLER            Subject =~ /Question.*(for|to|from
eBay).*(seller|Member)/
+header Q_FOR_SELLER            Subject =~ /Question.*(?:for|to|from
eBay).*(?:seller|Member)/

-uri NORMAL_HTTP_TO_IP   /^https?\:\/\/\d+\.\d+\.\d+\.\d+/is
+uri NORMAL_HTTP_TO_IP   /^https?\:\/\/(?:\d+\.){3}\d+/is

-uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i
+uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i




This next section has various (|foo) and the like.  I can't figure out why
that's better than (foo)?, so I rewrote them:


-body EXCUSE_15                 /this (?:|e?-?mail|message) (?:is|was)
(?:not|never) (?:spam|(?:sent |)unsolicited)/i
+body EXCUSE_15                 /this\s*(?:e?-?mail|message)? (?:is|was)
n(?:ot|ever) (?:spam|(?:sent )?unsolicited)/i

-body FINANCIAL                 /\bfinancial(?:ly|) free/i
+body FINANCIAL                 /\bfinancial(?:ly)? free/i

-body REFINANCE_YOUR_HOME        /\brefinance your (?:current|) (?:home|house)\b/i
+body REFINANCE_YOUR_HOME        /\brefinance your (?:current)? h(?:ome|ouse)\b/i



Now are the improved rules.  Fix spelling errors, try to match more things, etc.

# This went from matching 0 of my corpus to at least matching 2.
# I block obvious ADV subject mails at SMTP, so I don't have a lot of these...
-header ADVERT_CODE             Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i
+header ADVERT_CODE             Subject =~ /\bADV\b/i

# Have gotten FPs off this, and whitespace can't be in the host...
-uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/]*%/
+uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/\s]*%/

-body SATISFACTION              /\bsatisfaction .{0,9}gauranteed|not
.{0,9}satisfied\b/i
+body SATISFACTION              /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not
.{0,9}satisfied\b/i

-body HARDCORE_PORN              /\bhard[ -]?core
.{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i
+body HARDCORE_PORN              /\bhard[ -]?core
.{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i

-body HOT_NASTY         
/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i
+body HOT_NASTY         
/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i

-body AMATUER_PORN               /\bamat[eu][eu]r
.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i
+body AMATUER_PORN               /\bamat(?:eu|ue)r
.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i

-body RAPE                       /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape
(?:sites?|sex)\b/i
+body RAPE                       /\b(?:virgin|gang|teen|amat(?:eu|ue)r)
rape|rape s(?:ites?|ex)\b/i
Comment 1 Theo Van Dinter 2002-07-20 14:21:04 UTC
Created attachment 234 [details]
Patch that implements all the proposed changes
Comment 2 Theo Van Dinter 2002-07-20 14:31:40 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

On Sat, Jul 20, 2002 at 02:20:16PM -0700, bugzilla-daemon@hughes-family.org wrote:
> The first are hopeful performance improvements.  Doing (xx|xy|xz) is more

The first section was actually any general changes that shouldn't change
what the rule does.  I added ?: to places, removed parens where they
didn't need to be, moved common (pre|suf)fix outside of parens, escaped
. where appropriate, etc.

> # This went from matching 0 of my corpus to at least matching 2.
> # I block obvious ADV subject mails at SMTP, so I don't have a lot of these...
> -header ADVERT_CODE             Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i
> +header ADVERT_CODE             Subject =~ /\bADV\b/i

After poking around some more, I found this also increased my FPs a
little bit, all abreviations of "advanced" -> "adv.".  I'd like to know
what it does against other folks' corpus.

> # Have gotten FPs off this, and whitespace can't be in the host...
> -uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/]*%/
> +uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/\s]*%/

I believe it was something in a signature like:

%   My homepage: http://someplace.domain.com    %

> -body SATISFACTION              /\bsatisfaction .{0,9}gauranteed|not
> .{0,9}satisfied\b/i
> +body SATISFACTION              /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not
> .{0,9}satisfied\b/i

The description has "guaranteed", but the rule had "gauranteed", so I
figure we might as well search for both. :)

> -body HARDCORE_PORN              /\bhard[ -]?core
> .{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i
> +body HARDCORE_PORN              /\bhard[ -]?core
> .{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i

There were a bunch of 'amat[eu][eu]r' which would match 'amateer' and
'amatuur'.  I figured it was better to search for 'amat(eu|ue)r' instead.

Comment 3 Marc Perkel 2002-07-20 18:30:26 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

I have a few comments about your improvements. Several do technically 
increase the accuracy but would not catch any more or less spam. Some 
save one character at the expense of readability and expandability. I 
thing that readability is very important.

bugzilla-daemon@hughes-family.org wrote:

>
>-body SENT_IN_COMPLIANCE                /message .{0,10}sen(?:d|t) in compliance
>(?:of|with)/i
>+body SENT_IN_COMPLIANCE                /message .{0,10}sen[dt] in compliance
>(?:of|with)/i
>
>-body EU_EMAIL_OPTOUT           /EU (?:e-?mail opt.?out|e.?commerce) directive/i
>+body EU_EMAIL_OPTOUT           /EU e(?:-?mail opt.?out|.?commerce) directive/i
>  
>
The above 2 look good.

>-body NO_COST                    /\bno (?:cost|charge)\b/i
>+body NO_COST                    /\bno c(?:ost|harge)\b/i
>  
>
Saves one byte but at the cost of good clean readability. I think 
readability and simplicity are more important. Especially if we want to 
add a third item that doesn't begin with a C.

>-body EXCUSE_6                  /\b(?:wish to|click to|To) remove yourself/i
>+body EXCUSE_6                  /\b(?:wish |click )?to remove yourself/i
>  
>
Isn't this just the same as  /to remove yourself/i

Also - I think this is a bad rule because of FP. This rule should die!

>-body EXCUSE_18                 /we do not (?:spam|send unsolicited)/i
>+body EXCUSE_18                 /we do not s(?:pam|end unsolicited)/i
>  
>
Not clean and readable.

>-body PRINT_FORM_SIGNATURE      /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i
>+body PRINT_FORM_SIGNATURE      /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i
>
>-body DOMAIN_BODY               /\s(\.|dot\s+)(info|biz|name)\s/i
>+body DOMAIN_BODY               /\s(?:\.|dot\s+)(?:info|biz|name)\s/i
>
>-rawbody MONSTERHUT             /monsterhut.com/
>+rawbody MONSTERHUT             /monsterhut\.com/
>
>-body JODY                      /\b(?:My wife, Jody|Mi esposa, Jody)/
>+body JODY                      /\bM(?:y wife|i esposa), Jody/
>
>-body MYCASINOBUILDER           /MYCASINOBUILDER.COM/i
>+body MYCASINOBUILDER           /MYCASINOBUILDER\.COM/i
>  
>
>-body     NO_DISSAPOINTMENT      /You won'?t be diss?app?ointed/i
>+body     NO_DISSAPOINTMENT      /You won'?t be dis+ap+ointed/i
>
>-body SEARCH_ENGINE_PROMO       
>/\b(?:(?:submitt?|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is
>+body SEARCH_ENGINE_PROMO       
>/\b(?:(?:submit+|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is
>
>-body WHY_WAIT                  /\b(?:why wait|what are you waiting for)\b/i
>+body WHY_WAIT                  /\bw(?:hy wait|hat are you waiting for)\b/i
>  
>
Again - I think readability is more important.

>-body NAME_BRAND                        /\b(?:famous name brand|major brand)/i
>+body NAME_BRAND                        /\b(?:famous name |major ) brand/i
>  
>
Might have broken this rule. Why trailing spaces?

>-body HAIR_LOSS                 /\b(?:thinn?ing|restore|grow|new) hair|hair loss/i
>+body HAIR_LOSS                 /\b(?:thin+ing|restore|grow|new) hair|hair loss/i
>  
>
OK

>-body UNCENSORED                 /\buncensored (?:pics|photo)/i
>+body UNCENSORED                 /\buncensored p(?:ics|hoto)/i
>  
>
Again - Readability - Suppose I wanted to add "movies" to the list?

>-header FROM_MALFORMED          From !~
>/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset]
>+header FROM_MALFORMED          From !~
>/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset]
>
>-header PLING_QUERY             Subject =~ /(?:\?.*!|!.*\?)/
>+header PLING_QUERY             Subject =~ /\?.*!|!.*\?/
>
>-header SUBJ_HAS_SPACES         Subject =~ /(?:\s{6,}|\t)/
>+header SUBJ_HAS_SPACES         Subject =~ /\s{6,}|\t/
>
>-header INVALID_DATE            Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d
>]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d)
>\d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/
>+header INVALID_DATE            Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat),
>)?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
>(?:\d{2}|\d{4}) \d{2}:\d{2}(?:\:\d{2})?
>(?:UT|[A-Z]{3,5}|[+-]\d{4})(?:\s+\(?:.*\))?\s*$/
>
>-header INVALID_DATE_TZ_ABSURD  Date =~ /[-+](?:1[4-9]\d\d|[2-9]\d\d\d)$/
>+header INVALID_DATE_TZ_ABSURD  Date =~ /[-+](?:1[4-9]\d{2}|[2-9]\d{3})$/
>
>-header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d\d\d(\s|$)/
>+header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d{3}(?:\s|$)/
>
>-header FRIEND_AT_PUBLIC        To =~ /(yourdomain|you|your|public).(com|org|net)/i
>+header FRIEND_AT_PUBLIC        To =~
>/(?:yourdomain|you|your|public)\.(?:com|org|net)/i
>
>-header DOMAIN_SUBJECT          Subject =~
>/(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i
>+header DOMAIN_SUBJECT          Subject =~
>/(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission)/i
>
>-header FAKED_IP_IN_RCVD                Received =~ /from
>[-0-9a-z\._]+_\[\d+\.\d+\.\d+\.\d+\] /i
>+header FAKED_IP_IN_RCVD                Received =~ /from
>[-0-9a-z\._]+_\[(?:\d+\.){3}\d+\] /i
>
>-header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
><\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s
>+header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
><\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s
>
>-header FROM_BTAMAIL            From =~ /\@btamail.net.cn/i
>+header FROM_BTAMAIL            From =~ /\@btamail\.net\.cn/i
>
>-header FROM_UGETMORE           From =~ /\@ugetmore4less.net/i
>+header FROM_UGETMORE           From =~ /\@ugetmore4less\.net/i
>
>-header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher.com/i
>+header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher\.com/i
>
>-header Q_FOR_SELLER            Subject =~ /Question.*(for|to|from
>eBay).*(seller|Member)/
>+header Q_FOR_SELLER            Subject =~ /Question.*(?:for|to|from
>eBay).*(?:seller|Member)/
>
>-uri NORMAL_HTTP_TO_IP   /^https?\:\/\/\d+\.\d+\.\d+\.\d+/is
>+uri NORMAL_HTTP_TO_IP   /^https?\:\/\/(?:\d+\.){3}\d+/is
>
>-uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i
>+uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i
>
>
>
>
>This next section has various (|foo) and the like.  I can't figure out why
>that's better than (foo)?, so I rewrote them:
>
>
>-body EXCUSE_15                 /this (?:|e?-?mail|message) (?:is|was)
>(?:not|never) (?:spam|(?:sent |)unsolicited)/i
>+body EXCUSE_15                 /this\s*(?:e?-?mail|message)? (?:is|was)
>n(?:ot|ever) (?:spam|(?:sent )?unsolicited)/i
>
>-body FINANCIAL                 /\bfinancial(?:ly|) free/i
>+body FINANCIAL                 /\bfinancial(?:ly)? free/i
>
>-body REFINANCE_YOUR_HOME        /\brefinance your (?:current|) (?:home|house)\b/i
>+body REFINANCE_YOUR_HOME        /\brefinance your (?:current)? h(?:ome|ouse)\b/i
>
>  
>
Again - Readability.

>
>Now are the improved rules.  Fix spelling errors, try to match more things, etc.
>
># This went from matching 0 of my corpus to at least matching 2.
># I block obvious ADV subject mails at SMTP, so I don't have a lot of these...
>-header ADVERT_CODE             Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i
>+header ADVERT_CODE             Subject =~ /\bADV\b/i
>
># Have gotten FPs off this, and whitespace can't be in the host...
>-uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/]*%/
>+uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/\s]*%/
>
>-body SATISFACTION              /\bsatisfaction .{0,9}gauranteed|not
>.{0,9}satisfied\b/i
>+body SATISFACTION              /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not
>.{0,9}satisfied\b/i
>
>-body HARDCORE_PORN              /\bhard[ -]?core
>.{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i
>+body HARDCORE_PORN              /\bhard[ -]?core
>.{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i
>
>-body HOT_NASTY         
>/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i
>+body HOT_NASTY         
>/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i
>
>-body AMATUER_PORN               /\bamat[eu][eu]r
>.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i
>+body AMATUER_PORN               /\bamat(?:eu|ue)r
>.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i
>
>-body RAPE                       /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape
>(?:sites?|sex)\b/i
>+body RAPE                       /\b(?:virgin|gang|teen|amat(?:eu|ue)r)
>rape|rape s(?:ites?|ex)\b/i
>
>  
>
Again - readability. Suppose I wanted to add "rape movies" ?


Comment 4 Theo Van Dinter 2002-07-20 19:19:17 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

On Sat, Jul 20, 2002 at 05:51:09PM -0700, Marc Perkel wrote:
> >-body NO_COST                    /\bno (?:cost|charge)\b/i
> >+body NO_COST                    /\bno c(?:ost|harge)\b/i
> Saves one byte but at the cost of good clean readability. I think 
> readability and simplicity are more important. Especially if we want to 
> add a third item that doesn't begin with a C.

Yes, but it's more efficient than the original -- it's not about saving
bytes, it's about performance.  Say "\bno " occurs X times in a mail --
"\bno c" will likely occur <X times, so the RE engine doesn't need to
look at all the other locations.

If we later want to add in another work that doesn't start with a 'c',
then it would change back to the original form with another "|word" on it.
But that's a different pattern. ;)

Overall, I don't these are a huge speed improvement, but in total overall
time it may add up.  For those of us running older/slower machines,
every cycle we can save is a definite win.

> >-body EXCUSE_6                  /\b(?:wish to|click to|To) remove 
> >yourself/i
> >+body EXCUSE_6                  /\b(?:wish |click )?to remove yourself/i
> Isn't this just the same as  /to remove yourself/i
> Also - I think this is a bad rule because of FP. This rule should die!

Good point -- /\bto remove yourself\b/i is more efficient.  I don't know
what to do about the FPs...  It's essentially a test for mailing lists
and spam.  I think we need a more specific text if we want to make it
more spammy (or have some other tests with enough negativity ...)

> >-body WHY_WAIT                  /\b(?:why wait|what are you waiting 
> >for)\b/i
> >+body WHY_WAIT                  /\bw(?:hy wait|hat are you waiting for)\b/i
> Again - I think readability is more important.

See my first comment again.  It's actually more efficient -- instead of
the RE stopping on every word boundary (\b) and trying to determine if
either set of following strings match, it'll only stop on '\bw' which
is much less common.

> >-body NAME_BRAND                        /\b(?:famous name brand|major 
> >brand)/i
> >+body NAME_BRAND                        /\b(?:famous name |major ) brand/i
> >
> Might have broken this rule. Why trailing spaces?

Should be /\b(?:famous name|major) brand/i ...  Good eye.

Comment 5 Daniel Quinlan 2002-07-21 01:13:47 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

Theo Van Dinter <felicity@kluge.net> wrote:

>>> -body NO_COST                    /\bno (?:cost|charge)\b/i
>>> +body NO_COST                    /\bno c(?:ost|harge)\b/i

Marc Perkel wrote:

>> Saves one byte but at the cost of good clean readability. I think 
>> readability and simplicity are more important. Especially if we want to 
>> add a third item that doesn't begin with a C.
 
Theo Van Dinter <felicity@kluge.net> writes:

> Yes, but it's more efficient than the original -- it's not about saving
> bytes, it's about performance.  Say "\bno " occurs X times in a mail --
> "\bno c" will likely occur <X times, so the RE engine doesn't need to
> look at all the other locations.

This is not a big deal, but I think Marc has a good point.  The
performance difference is probably insignificant.  On the other hand, we
continually have errors in regular expressions, often when "excessive
cleverness" has been applied.

This seems like a pretty good example of premature/excessive
optimization.  There is no data showing that the relevant code is run
for any significant period of time or that these changes produce a
measurable improvement in performance.  Maybe they do, but it would be
nice to know before we complicate every regular expression.

In contrast, your changes to the eval loops in PerMsgStatus.pm were
great.  The code was responsible for a lot of our execution time and
there was a huge speed improvement.  Even better, the code was just as
easy to understand as the original.

Dan

Comment 6 Marc Perkel 2002-07-21 07:28:44 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

Thanks Dan. And the changes were very clever. Some of his rule changes 
actually did make thing more readable and he is highly skilled at 
regular expressions.  I have found that readability and 
understandability is important in maintaining code. I think in a group 
project that it's more important.

Daniel Quinlan wrote:

>This is not a big deal, but I think Marc has a good point.  The
>performance difference is probably insignificant.  On the other hand, we
>continually have errors in regular expressions, often when "excessive
>cleverness" has been applied.
>
>This seems like a pretty good example of premature/excessive
>optimization.  There is no data showing that the relevant code is run
>for any significant period of time or that these changes produce a
>measurable improvement in performance.  Maybe they do, but it would be
>nice to know before we complicate every regular expression.
>
>In contrast, your changes to the eval loops in PerMsgStatus.pm were
>great.  The code was responsible for a lot of our execution time and
>there was a huge speed improvement.  Even better, the code was just as
>easy to understand as the original.
>
>Dan
>
>  
>


Comment 7 Theo Van Dinter 2002-07-21 09:09:38 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

On Sun, Jul 21, 2002 at 07:28:30AM -0700, Marc Perkel wrote:
> Thanks Dan. And the changes were very clever. Some of his rule changes 
> actually did make thing more readable and he is highly skilled at 
> regular expressions.  I have found that readability and 
> understandability is important in maintaining code. I think in a group 
> project that it's more important.

Ok, the changes were all just suggestions anyway.  Personally, I'm
more interested in performance and accuracy than anything else at the
moment, so ...  I did find them more readable, but then again I'm
fairly comfortable with regular expression.  At least we got some good
discussion out of it. :)

So, ignoring the single character commonality changes and changing things
like \d\d\d to \d{3}, these all need consideration:

# Have gotten FPs off this, and whitespace can't be in the host, so...
# %    Visit my homepage: http://i.like.foo.com    %
-uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/]*%/
+uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/\s]*%/

# "gauranteed" is misspelled, and the description has it correct, so search for both.
-body SATISFACTION              /\bsatisfaction .{0,9}gauranteed|not
.{0,9}satisfied\b/i
+body SATISFACTION              /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not
.{0,9}satisfied\b/i

# doing "|" with a blank is confusing and non-efficient since you mean "(...)?"
-body EXCUSE_15                 /this (?:|e?-?mail|message) (?:is|was)
(?:not|never) (?:spam|(?:sent |)unsolicited)/i
+body EXCUSE_15                 /this\s*(?:e?-?mail|message)? (?:is|was)
(?:not|never) (?:spam|(?:sent )?unsolicited)/i

# doing "|" with a blank is confusing and non-efficient since you mean "(...)?"
-body FINANCIAL                 /\bfinancial(?:ly|) free/i
+body FINANCIAL                 /\bfinancial(?:ly)? free/i

# doing "|" with a blank is confusing and non-efficient since you mean "(...)?"
-body REFINANCE_YOUR_HOME        /\brefinance your (?:current|) (?:home|house)\b/i
+body REFINANCE_YOUR_HOME        /\brefinance your (?:current)? (?:home|house)\b/i

# If you're looking for a single character use [], more readable and efficient
-body SENT_IN_COMPLIANCE                /message .{0,10}sen(?:d|t) in compliance
(?:of|with)/i
+body SENT_IN_COMPLIANCE                /message .{0,10}sen[dt] in compliance
(?:of|with)/i

# "to remove yourself" matches all three, so remove the unnecessary parts
-body EXCUSE_6                  /\b(?:wish to|click to|To) remove yourself/i
+body EXCUSE_6                  /\bto remove yourself/i

# Pulled out the common section
-body JODY                      /\b(?:My wife, Jody|Mi esposa, Jody)/
+body JODY                      /\b(?:My wife|Mi esposa), Jody/

# pulled out the common section
-body NAME_BRAND                        /\b(?:famous name brand|major brand)/i
+body NAME_BRAND                        /\b(?:famous name|major) brand/i


# Added ?: and pulled unnecessary duplicate \s* out from the second one
-body PRINT_FORM_SIGNATURE      /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i
+body PRINT_FORM_SIGNATURE      /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i

# added ?:
-body DOMAIN_BODY               /\s(\.|dot\s+)(info|biz|name)\s/i
+body DOMAIN_BODY               /\s(?:\.|dot\s+)(?:info|biz|name)\s/i

# escaped the .
-rawbody MONSTERHUT             /monsterhut.com/
+rawbody MONSTERHUT             /monsterhut\.com/

# escaped the .
-body MYCASINOBUILDER           /MYCASINOBUILDER.COM/i
+body MYCASINOBUILDER           /MYCASINOBUILDER\.COM/i

# Added ?: and replaced the {1,} with + since they're equivalent
-header FROM_MALFORMED          From !~
/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset]
+header FROM_MALFORMED          From !~
/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset]

# Don't need to enclose it in parens
-header PLING_QUERY             Subject =~ /(?:\?.*!|!.*\?)/
+header PLING_QUERY             Subject =~ /\?.*!|!.*\?/

# Don't need to enclose it in parens
-header SUBJ_HAS_SPACES         Subject =~ /(?:\s{6,}|\t)/
+header SUBJ_HAS_SPACES         Subject =~ /\s{6,}|\t/

# added ?:
-header INVALID_DATE            Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d
]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d)
\d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/
+header INVALID_DATE            Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat),
)?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
(?:\d\d|\d\d\d\d) \d\d:\d\d(?:\:\d\d)?
(?:UT|[A-Z]{3,5}|[+-]\d\d\d\d)(?:\s+\(?:.*\))?\s*$/

# added ?:
-header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d\d\d(\s|$)/
+header DATE_YEAR_ZERO_FIRST    Date =~ /[a-z]\s+0\d\d\d(?:\s|$)/

# added ?: and escaped .
-header FRIEND_AT_PUBLIC        To =~ /(yourdomain|you|your|public).(com|org|net)/i
+header FRIEND_AT_PUBLIC        To =~
/(?:yourdomain|you|your|public)\.(?:com|org|net)/i

# added ?:
-header DOMAIN_SUBJECT          Subject =~
/(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i
+header DOMAIN_SUBJECT          Subject =~
/(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission
)/i

# escaped .
-header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
<\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s
+header YAHOO_MSGID_ADDED       ALL =~ /Message-Id:
<\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s

# escaped .
-header FROM_BTAMAIL            From =~ /\@btamail.net.cn/i
+header FROM_BTAMAIL            From =~ /\@btamail\.net\.cn/i

# escaped .
-header FROM_UGETMORE           From =~ /\@ugetmore4less.net/i
+header FROM_UGETMORE           From =~ /\@ugetmore4less\.net/i

# escaped .
-header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher.com/i
+header FROM_TOPICA             From =~ /\@(?:\w\.)*email-publisher\.com/i

# added ?:
-header Q_FOR_SELLER            Subject =~ /Question.*(for|to|from
eBay).*(seller|Member)/
+header Q_FOR_SELLER            Subject =~ /Question.*(?:for|to|from
eBay).*(?:seller|Member)/

# added ?:
-uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i
+uri UNSUB_SCRIPT        /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i


# the rest of these replace [eu][eu] with (?:eu|ue) to restrict what we match
-body HARDCORE_PORN              /\bhard[ -]?core
.{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i
+body HARDCORE_PORN              /\bhard[ -]?core
.{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i

-body HOT_NASTY
/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|nau
ghty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pic
s|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i
+body HOT_NASTY
/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|nau
ghty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pic
s|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i

-body AMATUER_PORN               /\bamat[eu][eu]r
.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i
+body AMATUER_PORN               /\bamat(?:eu|ue)r
.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i

-body RAPE                       /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape
(?:sites?|sex)\b/i
+body RAPE                       /\b(?:virgin|gang|teen|amat(?:eu|ue)r)
rape|rape (?:sites?|sex)\b/i

Comment 8 Theo Van Dinter 2002-07-21 09:27:15 UTC
Subject: Re: [SAdev]  New: More general rule cleanup

On Sun, Jul 21, 2002 at 12:09:35PM -0400, Theo Van Dinter wrote:
> # Have gotten FPs off this, and whitespace can't be in the host, so...
> # %    Visit my homepage: http://i.like.foo.com    %
> -uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/]*%/
> +uri HTTP_ESCAPED_HOST       /^https?\:\/\/[^\/\s]*%/

Actually, this one was fixed in the code (how the URLs are parsed out
of the messages), so we can ignore this rule change.  I forgot about
that when I was looking at the rules.  :)

Comment 9 Justin Mason 2002-07-29 11:07:22 UTC
can we resolve this bug?
Comment 10 Theo Van Dinter 2002-07-29 12:33:06 UTC
Subject: Re:  More general rule cleanup

On Mon, Jul 29, 2002 at 11:07:22AM -0700, bugzilla-daemon@hughes-family.org wrote:
> can we resolve this bug?

The discussion about the rule changes just stopped.  If there are
no problems with the remaining changes, I'll make up a patch and we
can apply.

Comment 11 Marc Perkel 2002-07-29 13:11:01 UTC
Subject: Re: [SAdev]  More general rule cleanup

I think that the changes we already added and many of them were 
dismissed as being bad suggestions.

bugzilla-daemon@hughes-family.org wrote:

>http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584
>
>
>
>
>
>------- Additional Comments From felicity@kluge.net  2002-07-29 12:33 -------
>Subject: Re:  More general rule cleanup
>
>On Mon, Jul 29, 2002 at 11:07:22AM -0700, bugzilla-daemon@hughes-family.org wrote:
>  
>
>>can we resolve this bug?
>>    
>>
>
>The discussion about the rule changes just stopped.  If there are
>no problems with the remaining changes, I'll make up a patch and we
>can apply.
>
>
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by: Dice - The leading online job board
>for high-tech professionals. Search and apply for tech jobs today!
>http://seeker.dice.com/seeker.epl?rel_code=31
>_______________________________________________
>Spamassassin-devel mailing list
>Spamassassin-devel@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/spamassassin-devel
>
>  
>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
  <title></title>
</head>
<body>
I think that the changes we already added and many of them were dismissed
as being bad suggestions.<br>
<br>
<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@hughes-family.org">bugzilla-daemon@hughes-family.org</a> wrote:<br>
<blockquote type="cite"
 cite="mid20020729193307.A6B579D208@belphegore.hughes-family.org">
  <pre wrap=""><a class="moz-txt-link-freetext" href="http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584">http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584</a>





------- Additional Comments From <a class="moz-txt-link-abbreviated" href="mailto:felicity@kluge.net">felicity@kluge.net</a>  2002-07-29 12:33 -------
Subject: Re:  More general rule cleanup

On Mon, Jul 29, 2002 at 11:07:22AM -0700, <a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@hughes-family.org">bugzilla-daemon@hughes-family.org</a> wrote:
  </pre>
  <blockquote type="cite">
    <pre wrap="">can we resolve this bug?
    </pre>
  </blockquote>
  <pre wrap=""><!---->
The discussion about the rule changes just stopped.  If there are
no problems with the remaining changes, I'll make up a patch and we
can apply.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


-------------------------------------------------------
This sf.net email is sponsored by: Dice - The leading online job board
for high-tech professionals. Search and apply for tech jobs today!
<a class="moz-txt-link-freetext" href="http://seeker.dice.com/seeker.epl?rel_code=31">http://seeker.dice.com/seeker.epl?rel_code=31</a>
_______________________________________________
Spamassassin-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Spamassassin-devel@lists.sourceforge.net">Spamassassin-devel@lists.sourceforge.net</a>
<a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/spamassassin-devel">https://lists.sourceforge.net/lists/listinfo/spamassassin-devel</a>

  </pre>
</blockquote>
<br>
</body>
</html>
Comment 12 Theo Van Dinter 2002-07-29 13:38:21 UTC
Subject: Re:  More general rule cleanup

On Mon, Jul 29, 2002 at 01:11:01PM -0700, bugzilla-daemon@hughes-family.org wrote:
> I think that the changes we already added and many of them were 
> dismissed as being bad suggestions.

Well, some people claimed some of the changes were "unreadable".  I then
posted the list that didn't fit that category and there were no comments.
In a quick look at current CVS, they weren't applied, so ...

Comment 13 Justin Mason 2002-08-13 16:16:38 UTC
ok, now checked in. sorry about the delay but there
was quite a lot of changes to verify...