SA Bugzilla – Bug 584
More general rule cleanup
Last modified: 2002-08-13 08:16:38 UTC
I poked through the 20*.cf rules today and have a few changes to clean up, improve performance, or hopefully make better matches: The first are hopeful performance improvements. Doing (xx|xy|xz) is more expensive than x(x|y|z) (I know I could do x[xyz] but assume the string is longer...) There were also some () without the ?: so it would try to do backreferences that are fixed: -body SENT_IN_COMPLIANCE /message .{0,10}sen(?:d|t) in compliance (?:of|with)/i +body SENT_IN_COMPLIANCE /message .{0,10}sen[dt] in compliance (?:of|with)/i -body EU_EMAIL_OPTOUT /EU (?:e-?mail opt.?out|e.?commerce) directive/i +body EU_EMAIL_OPTOUT /EU e(?:-?mail opt.?out|.?commerce) directive/i -body NO_COST /\bno (?:cost|charge)\b/i +body NO_COST /\bno c(?:ost|harge)\b/i -body EXCUSE_6 /\b(?:wish to|click to|To) remove yourself/i +body EXCUSE_6 /\b(?:wish |click )?to remove yourself/i -body EXCUSE_18 /we do not (?:spam|send unsolicited)/i +body EXCUSE_18 /we do not s(?:pam|end unsolicited)/i -body PRINT_FORM_SIGNATURE /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i +body PRINT_FORM_SIGNATURE /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i -body DOMAIN_BODY /\s(\.|dot\s+)(info|biz|name)\s/i +body DOMAIN_BODY /\s(?:\.|dot\s+)(?:info|biz|name)\s/i -rawbody MONSTERHUT /monsterhut.com/ +rawbody MONSTERHUT /monsterhut\.com/ -body JODY /\b(?:My wife, Jody|Mi esposa, Jody)/ +body JODY /\bM(?:y wife|i esposa), Jody/ -body MYCASINOBUILDER /MYCASINOBUILDER.COM/i +body MYCASINOBUILDER /MYCASINOBUILDER\.COM/i -body NO_DISSAPOINTMENT /You won'?t be diss?app?ointed/i +body NO_DISSAPOINTMENT /You won'?t be dis+ap+ointed/i -body SEARCH_ENGINE_PROMO /\b(?:(?:submitt?|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is +body SEARCH_ENGINE_PROMO /\b(?:(?:submit+|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is -body WHY_WAIT /\b(?:why wait|what are you waiting for)\b/i +body WHY_WAIT /\bw(?:hy wait|hat are you waiting for)\b/i -body NAME_BRAND /\b(?:famous name brand|major brand)/i +body NAME_BRAND /\b(?:famous name |major ) brand/i -body HAIR_LOSS /\b(?:thinn?ing|restore|grow|new) hair|hair loss/i +body HAIR_LOSS /\b(?:thin+ing|restore|grow|new) hair|hair loss/i -body UNCENSORED /\buncensored (?:pics|photo)/i +body UNCENSORED /\buncensored p(?:ics|hoto)/i -header FROM_MALFORMED From !~ /(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset] +header FROM_MALFORMED From !~ /(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset] -header PLING_QUERY Subject =~ /(?:\?.*!|!.*\?)/ +header PLING_QUERY Subject =~ /\?.*!|!.*\?/ -header SUBJ_HAS_SPACES Subject =~ /(?:\s{6,}|\t)/ +header SUBJ_HAS_SPACES Subject =~ /\s{6,}|\t/ -header INVALID_DATE Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d ]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d) \d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/ +header INVALID_DATE Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:\d{2}|\d{4}) \d{2}:\d{2}(?:\:\d{2})? (?:UT|[A-Z]{3,5}|[+-]\d{4})(?:\s+\(?:.*\))?\s*$/ -header INVALID_DATE_TZ_ABSURD Date =~ /[-+](?:1[4-9]\d\d|[2-9]\d\d\d)$/ +header INVALID_DATE_TZ_ABSURD Date =~ /[-+](?:1[4-9]\d{2}|[2-9]\d{3})$/ -header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d\d\d(\s|$)/ +header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d{3}(?:\s|$)/ -header FRIEND_AT_PUBLIC To =~ /(yourdomain|you|your|public).(com|org|net)/i +header FRIEND_AT_PUBLIC To =~ /(?:yourdomain|you|your|public)\.(?:com|org|net)/i -header DOMAIN_SUBJECT Subject =~ /(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i +header DOMAIN_SUBJECT Subject =~ /(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission)/i -header FAKED_IP_IN_RCVD Received =~ /from [-0-9a-z\._]+_\[\d+\.\d+\.\d+\.\d+\] /i +header FAKED_IP_IN_RCVD Received =~ /from [-0-9a-z\._]+_\[(?:\d+\.){3}\d+\] /i -header YAHOO_MSGID_ADDED ALL =~ /Message-Id: <\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s +header YAHOO_MSGID_ADDED ALL =~ /Message-Id: <\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s -header FROM_BTAMAIL From =~ /\@btamail.net.cn/i +header FROM_BTAMAIL From =~ /\@btamail\.net\.cn/i -header FROM_UGETMORE From =~ /\@ugetmore4less.net/i +header FROM_UGETMORE From =~ /\@ugetmore4less\.net/i -header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher.com/i +header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher\.com/i -header Q_FOR_SELLER Subject =~ /Question.*(for|to|from eBay).*(seller|Member)/ +header Q_FOR_SELLER Subject =~ /Question.*(?:for|to|from eBay).*(?:seller|Member)/ -uri NORMAL_HTTP_TO_IP /^https?\:\/\/\d+\.\d+\.\d+\.\d+/is +uri NORMAL_HTTP_TO_IP /^https?\:\/\/(?:\d+\.){3}\d+/is -uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i +uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i This next section has various (|foo) and the like. I can't figure out why that's better than (foo)?, so I rewrote them: -body EXCUSE_15 /this (?:|e?-?mail|message) (?:is|was) (?:not|never) (?:spam|(?:sent |)unsolicited)/i +body EXCUSE_15 /this\s*(?:e?-?mail|message)? (?:is|was) n(?:ot|ever) (?:spam|(?:sent )?unsolicited)/i -body FINANCIAL /\bfinancial(?:ly|) free/i +body FINANCIAL /\bfinancial(?:ly)? free/i -body REFINANCE_YOUR_HOME /\brefinance your (?:current|) (?:home|house)\b/i +body REFINANCE_YOUR_HOME /\brefinance your (?:current)? h(?:ome|ouse)\b/i Now are the improved rules. Fix spelling errors, try to match more things, etc. # This went from matching 0 of my corpus to at least matching 2. # I block obvious ADV subject mails at SMTP, so I don't have a lot of these... -header ADVERT_CODE Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i +header ADVERT_CODE Subject =~ /\bADV\b/i # Have gotten FPs off this, and whitespace can't be in the host... -uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/]*%/ +uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/\s]*%/ -body SATISFACTION /\bsatisfaction .{0,9}gauranteed|not .{0,9}satisfied\b/i +body SATISFACTION /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not .{0,9}satisfied\b/i -body HARDCORE_PORN /\bhard[ -]?core .{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i +body HARDCORE_PORN /\bhard[ -]?core .{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i -body HOT_NASTY /\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i +body HOT_NASTY /\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i -body AMATUER_PORN /\bamat[eu][eu]r .{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i +body AMATUER_PORN /\bamat(?:eu|ue)r .{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i -body RAPE /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape (?:sites?|sex)\b/i +body RAPE /\b(?:virgin|gang|teen|amat(?:eu|ue)r) rape|rape s(?:ites?|ex)\b/i
Created attachment 234 [details] Patch that implements all the proposed changes
Subject: Re: [SAdev] New: More general rule cleanup On Sat, Jul 20, 2002 at 02:20:16PM -0700, bugzilla-daemon@hughes-family.org wrote: > The first are hopeful performance improvements. Doing (xx|xy|xz) is more The first section was actually any general changes that shouldn't change what the rule does. I added ?: to places, removed parens where they didn't need to be, moved common (pre|suf)fix outside of parens, escaped . where appropriate, etc. > # This went from matching 0 of my corpus to at least matching 2. > # I block obvious ADV subject mails at SMTP, so I don't have a lot of these... > -header ADVERT_CODE Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i > +header ADVERT_CODE Subject =~ /\bADV\b/i After poking around some more, I found this also increased my FPs a little bit, all abreviations of "advanced" -> "adv.". I'd like to know what it does against other folks' corpus. > # Have gotten FPs off this, and whitespace can't be in the host... > -uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/]*%/ > +uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/\s]*%/ I believe it was something in a signature like: % My homepage: http://someplace.domain.com % > -body SATISFACTION /\bsatisfaction .{0,9}gauranteed|not > .{0,9}satisfied\b/i > +body SATISFACTION /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not > .{0,9}satisfied\b/i The description has "guaranteed", but the rule had "gauranteed", so I figure we might as well search for both. :) > -body HARDCORE_PORN /\bhard[ -]?core > .{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i > +body HARDCORE_PORN /\bhard[ -]?core > .{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i There were a bunch of 'amat[eu][eu]r' which would match 'amateer' and 'amatuur'. I figured it was better to search for 'amat(eu|ue)r' instead.
Subject: Re: [SAdev] New: More general rule cleanup I have a few comments about your improvements. Several do technically increase the accuracy but would not catch any more or less spam. Some save one character at the expense of readability and expandability. I thing that readability is very important. bugzilla-daemon@hughes-family.org wrote: > >-body SENT_IN_COMPLIANCE /message .{0,10}sen(?:d|t) in compliance >(?:of|with)/i >+body SENT_IN_COMPLIANCE /message .{0,10}sen[dt] in compliance >(?:of|with)/i > >-body EU_EMAIL_OPTOUT /EU (?:e-?mail opt.?out|e.?commerce) directive/i >+body EU_EMAIL_OPTOUT /EU e(?:-?mail opt.?out|.?commerce) directive/i > > The above 2 look good. >-body NO_COST /\bno (?:cost|charge)\b/i >+body NO_COST /\bno c(?:ost|harge)\b/i > > Saves one byte but at the cost of good clean readability. I think readability and simplicity are more important. Especially if we want to add a third item that doesn't begin with a C. >-body EXCUSE_6 /\b(?:wish to|click to|To) remove yourself/i >+body EXCUSE_6 /\b(?:wish |click )?to remove yourself/i > > Isn't this just the same as /to remove yourself/i Also - I think this is a bad rule because of FP. This rule should die! >-body EXCUSE_18 /we do not (?:spam|send unsolicited)/i >+body EXCUSE_18 /we do not s(?:pam|end unsolicited)/i > > Not clean and readable. >-body PRINT_FORM_SIGNATURE /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i >+body PRINT_FORM_SIGNATURE /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i > >-body DOMAIN_BODY /\s(\.|dot\s+)(info|biz|name)\s/i >+body DOMAIN_BODY /\s(?:\.|dot\s+)(?:info|biz|name)\s/i > >-rawbody MONSTERHUT /monsterhut.com/ >+rawbody MONSTERHUT /monsterhut\.com/ > >-body JODY /\b(?:My wife, Jody|Mi esposa, Jody)/ >+body JODY /\bM(?:y wife|i esposa), Jody/ > >-body MYCASINOBUILDER /MYCASINOBUILDER.COM/i >+body MYCASINOBUILDER /MYCASINOBUILDER\.COM/i > > >-body NO_DISSAPOINTMENT /You won'?t be diss?app?ointed/i >+body NO_DISSAPOINTMENT /You won'?t be dis+ap+ointed/i > >-body SEARCH_ENGINE_PROMO >/\b(?:(?:submitt?|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is >+body SEARCH_ENGINE_PROMO >/\b(?:(?:submit+|list)(?:ed|ing|s)?|place(?:d|ment))\s+.{0,15}\b(?:in|to)\b.{0,15}\b(?:(?:top|best|major|largest|biggest).{0,15}\b)?(?:search(?:ing)?\s*(?:engine|site)|director(?:y|ies))\b/is > >-body WHY_WAIT /\b(?:why wait|what are you waiting for)\b/i >+body WHY_WAIT /\bw(?:hy wait|hat are you waiting for)\b/i > > Again - I think readability is more important. >-body NAME_BRAND /\b(?:famous name brand|major brand)/i >+body NAME_BRAND /\b(?:famous name |major ) brand/i > > Might have broken this rule. Why trailing spaces? >-body HAIR_LOSS /\b(?:thinn?ing|restore|grow|new) hair|hair loss/i >+body HAIR_LOSS /\b(?:thin+ing|restore|grow|new) hair|hair loss/i > > OK >-body UNCENSORED /\buncensored (?:pics|photo)/i >+body UNCENSORED /\buncensored p(?:ics|hoto)/i > > Again - Readability - Suppose I wanted to add "movies" to the list? >-header FROM_MALFORMED From !~ >/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset] >+header FROM_MALFORMED From !~ >/(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset] > >-header PLING_QUERY Subject =~ /(?:\?.*!|!.*\?)/ >+header PLING_QUERY Subject =~ /\?.*!|!.*\?/ > >-header SUBJ_HAS_SPACES Subject =~ /(?:\s{6,}|\t)/ >+header SUBJ_HAS_SPACES Subject =~ /\s{6,}|\t/ > >-header INVALID_DATE Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d >]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d) >\d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/ >+header INVALID_DATE Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat), >)?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) >(?:\d{2}|\d{4}) \d{2}:\d{2}(?:\:\d{2})? >(?:UT|[A-Z]{3,5}|[+-]\d{4})(?:\s+\(?:.*\))?\s*$/ > >-header INVALID_DATE_TZ_ABSURD Date =~ /[-+](?:1[4-9]\d\d|[2-9]\d\d\d)$/ >+header INVALID_DATE_TZ_ABSURD Date =~ /[-+](?:1[4-9]\d{2}|[2-9]\d{3})$/ > >-header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d\d\d(\s|$)/ >+header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d{3}(?:\s|$)/ > >-header FRIEND_AT_PUBLIC To =~ /(yourdomain|you|your|public).(com|org|net)/i >+header FRIEND_AT_PUBLIC To =~ >/(?:yourdomain|you|your|public)\.(?:com|org|net)/i > >-header DOMAIN_SUBJECT Subject =~ >/(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i >+header DOMAIN_SUBJECT Subject =~ >/(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission)/i > >-header FAKED_IP_IN_RCVD Received =~ /from >[-0-9a-z\._]+_\[\d+\.\d+\.\d+\.\d+\] /i >+header FAKED_IP_IN_RCVD Received =~ /from >[-0-9a-z\._]+_\[(?:\d+\.){3}\d+\] /i > >-header YAHOO_MSGID_ADDED ALL =~ /Message-Id: ><\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s >+header YAHOO_MSGID_ADDED ALL =~ /Message-Id: ><\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s > >-header FROM_BTAMAIL From =~ /\@btamail.net.cn/i >+header FROM_BTAMAIL From =~ /\@btamail\.net\.cn/i > >-header FROM_UGETMORE From =~ /\@ugetmore4less.net/i >+header FROM_UGETMORE From =~ /\@ugetmore4less\.net/i > >-header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher.com/i >+header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher\.com/i > >-header Q_FOR_SELLER Subject =~ /Question.*(for|to|from >eBay).*(seller|Member)/ >+header Q_FOR_SELLER Subject =~ /Question.*(?:for|to|from >eBay).*(?:seller|Member)/ > >-uri NORMAL_HTTP_TO_IP /^https?\:\/\/\d+\.\d+\.\d+\.\d+/is >+uri NORMAL_HTTP_TO_IP /^https?\:\/\/(?:\d+\.){3}\d+/is > >-uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i >+uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i > > > > >This next section has various (|foo) and the like. I can't figure out why >that's better than (foo)?, so I rewrote them: > > >-body EXCUSE_15 /this (?:|e?-?mail|message) (?:is|was) >(?:not|never) (?:spam|(?:sent |)unsolicited)/i >+body EXCUSE_15 /this\s*(?:e?-?mail|message)? (?:is|was) >n(?:ot|ever) (?:spam|(?:sent )?unsolicited)/i > >-body FINANCIAL /\bfinancial(?:ly|) free/i >+body FINANCIAL /\bfinancial(?:ly)? free/i > >-body REFINANCE_YOUR_HOME /\brefinance your (?:current|) (?:home|house)\b/i >+body REFINANCE_YOUR_HOME /\brefinance your (?:current)? h(?:ome|ouse)\b/i > > > Again - Readability. > >Now are the improved rules. Fix spelling errors, try to match more things, etc. > ># This went from matching 0 of my corpus to at least matching 2. ># I block obvious ADV subject mails at SMTP, so I don't have a lot of these... >-header ADVERT_CODE Subject =~ /(^\s*|\s+)ADV([\s:-]|$)/i >+header ADVERT_CODE Subject =~ /\bADV\b/i > ># Have gotten FPs off this, and whitespace can't be in the host... >-uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/]*%/ >+uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/\s]*%/ > >-body SATISFACTION /\bsatisfaction .{0,9}gauranteed|not >.{0,9}satisfied\b/i >+body SATISFACTION /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not >.{0,9}satisfied\b/i > >-body HARDCORE_PORN /\bhard[ -]?core >.{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i >+body HARDCORE_PORN /\bhard[ -]?core >.{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i > >-body HOT_NASTY >/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i >+body HOT_NASTY >/\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|naughty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pics|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i > >-body AMATUER_PORN /\bamat[eu][eu]r >.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i >+body AMATUER_PORN /\bamat(?:eu|ue)r >.{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i > >-body RAPE /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape >(?:sites?|sex)\b/i >+body RAPE /\b(?:virgin|gang|teen|amat(?:eu|ue)r) >rape|rape s(?:ites?|ex)\b/i > > > Again - readability. Suppose I wanted to add "rape movies" ?
Subject: Re: [SAdev] New: More general rule cleanup On Sat, Jul 20, 2002 at 05:51:09PM -0700, Marc Perkel wrote: > >-body NO_COST /\bno (?:cost|charge)\b/i > >+body NO_COST /\bno c(?:ost|harge)\b/i > Saves one byte but at the cost of good clean readability. I think > readability and simplicity are more important. Especially if we want to > add a third item that doesn't begin with a C. Yes, but it's more efficient than the original -- it's not about saving bytes, it's about performance. Say "\bno " occurs X times in a mail -- "\bno c" will likely occur <X times, so the RE engine doesn't need to look at all the other locations. If we later want to add in another work that doesn't start with a 'c', then it would change back to the original form with another "|word" on it. But that's a different pattern. ;) Overall, I don't these are a huge speed improvement, but in total overall time it may add up. For those of us running older/slower machines, every cycle we can save is a definite win. > >-body EXCUSE_6 /\b(?:wish to|click to|To) remove > >yourself/i > >+body EXCUSE_6 /\b(?:wish |click )?to remove yourself/i > Isn't this just the same as /to remove yourself/i > Also - I think this is a bad rule because of FP. This rule should die! Good point -- /\bto remove yourself\b/i is more efficient. I don't know what to do about the FPs... It's essentially a test for mailing lists and spam. I think we need a more specific text if we want to make it more spammy (or have some other tests with enough negativity ...) > >-body WHY_WAIT /\b(?:why wait|what are you waiting > >for)\b/i > >+body WHY_WAIT /\bw(?:hy wait|hat are you waiting for)\b/i > Again - I think readability is more important. See my first comment again. It's actually more efficient -- instead of the RE stopping on every word boundary (\b) and trying to determine if either set of following strings match, it'll only stop on '\bw' which is much less common. > >-body NAME_BRAND /\b(?:famous name brand|major > >brand)/i > >+body NAME_BRAND /\b(?:famous name |major ) brand/i > > > Might have broken this rule. Why trailing spaces? Should be /\b(?:famous name|major) brand/i ... Good eye.
Subject: Re: [SAdev] New: More general rule cleanup Theo Van Dinter <felicity@kluge.net> wrote: >>> -body NO_COST /\bno (?:cost|charge)\b/i >>> +body NO_COST /\bno c(?:ost|harge)\b/i Marc Perkel wrote: >> Saves one byte but at the cost of good clean readability. I think >> readability and simplicity are more important. Especially if we want to >> add a third item that doesn't begin with a C. Theo Van Dinter <felicity@kluge.net> writes: > Yes, but it's more efficient than the original -- it's not about saving > bytes, it's about performance. Say "\bno " occurs X times in a mail -- > "\bno c" will likely occur <X times, so the RE engine doesn't need to > look at all the other locations. This is not a big deal, but I think Marc has a good point. The performance difference is probably insignificant. On the other hand, we continually have errors in regular expressions, often when "excessive cleverness" has been applied. This seems like a pretty good example of premature/excessive optimization. There is no data showing that the relevant code is run for any significant period of time or that these changes produce a measurable improvement in performance. Maybe they do, but it would be nice to know before we complicate every regular expression. In contrast, your changes to the eval loops in PerMsgStatus.pm were great. The code was responsible for a lot of our execution time and there was a huge speed improvement. Even better, the code was just as easy to understand as the original. Dan
Subject: Re: [SAdev] New: More general rule cleanup Thanks Dan. And the changes were very clever. Some of his rule changes actually did make thing more readable and he is highly skilled at regular expressions. I have found that readability and understandability is important in maintaining code. I think in a group project that it's more important. Daniel Quinlan wrote: >This is not a big deal, but I think Marc has a good point. The >performance difference is probably insignificant. On the other hand, we >continually have errors in regular expressions, often when "excessive >cleverness" has been applied. > >This seems like a pretty good example of premature/excessive >optimization. There is no data showing that the relevant code is run >for any significant period of time or that these changes produce a >measurable improvement in performance. Maybe they do, but it would be >nice to know before we complicate every regular expression. > >In contrast, your changes to the eval loops in PerMsgStatus.pm were >great. The code was responsible for a lot of our execution time and >there was a huge speed improvement. Even better, the code was just as >easy to understand as the original. > >Dan > > >
Subject: Re: [SAdev] New: More general rule cleanup On Sun, Jul 21, 2002 at 07:28:30AM -0700, Marc Perkel wrote: > Thanks Dan. And the changes were very clever. Some of his rule changes > actually did make thing more readable and he is highly skilled at > regular expressions. I have found that readability and > understandability is important in maintaining code. I think in a group > project that it's more important. Ok, the changes were all just suggestions anyway. Personally, I'm more interested in performance and accuracy than anything else at the moment, so ... I did find them more readable, but then again I'm fairly comfortable with regular expression. At least we got some good discussion out of it. :) So, ignoring the single character commonality changes and changing things like \d\d\d to \d{3}, these all need consideration: # Have gotten FPs off this, and whitespace can't be in the host, so... # % Visit my homepage: http://i.like.foo.com % -uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/]*%/ +uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/\s]*%/ # "gauranteed" is misspelled, and the description has it correct, so search for both. -body SATISFACTION /\bsatisfaction .{0,9}gauranteed|not .{0,9}satisfied\b/i +body SATISFACTION /\bsatisfaction .{0,9}g(?:au|ua)ranteed|not .{0,9}satisfied\b/i # doing "|" with a blank is confusing and non-efficient since you mean "(...)?" -body EXCUSE_15 /this (?:|e?-?mail|message) (?:is|was) (?:not|never) (?:spam|(?:sent |)unsolicited)/i +body EXCUSE_15 /this\s*(?:e?-?mail|message)? (?:is|was) (?:not|never) (?:spam|(?:sent )?unsolicited)/i # doing "|" with a blank is confusing and non-efficient since you mean "(...)?" -body FINANCIAL /\bfinancial(?:ly|) free/i +body FINANCIAL /\bfinancial(?:ly)? free/i # doing "|" with a blank is confusing and non-efficient since you mean "(...)?" -body REFINANCE_YOUR_HOME /\brefinance your (?:current|) (?:home|house)\b/i +body REFINANCE_YOUR_HOME /\brefinance your (?:current)? (?:home|house)\b/i # If you're looking for a single character use [], more readable and efficient -body SENT_IN_COMPLIANCE /message .{0,10}sen(?:d|t) in compliance (?:of|with)/i +body SENT_IN_COMPLIANCE /message .{0,10}sen[dt] in compliance (?:of|with)/i # "to remove yourself" matches all three, so remove the unnecessary parts -body EXCUSE_6 /\b(?:wish to|click to|To) remove yourself/i +body EXCUSE_6 /\bto remove yourself/i # Pulled out the common section -body JODY /\b(?:My wife, Jody|Mi esposa, Jody)/ +body JODY /\b(?:My wife|Mi esposa), Jody/ # pulled out the common section -body NAME_BRAND /\b(?:famous name brand|major brand)/i +body NAME_BRAND /\b(?:famous name|major) brand/i # Added ?: and pulled unnecessary duplicate \s* out from the second one -body PRINT_FORM_SIGNATURE /Sign(ature)?(?:\s*here|\s*please)?:.{0,30}___*/i +body PRINT_FORM_SIGNATURE /Sign(?:ature)?\s*(?:here|please)?:.{0,30}___*/i # added ?: -body DOMAIN_BODY /\s(\.|dot\s+)(info|biz|name)\s/i +body DOMAIN_BODY /\s(?:\.|dot\s+)(?:info|biz|name)\s/i # escaped the . -rawbody MONSTERHUT /monsterhut.com/ +rawbody MONSTERHUT /monsterhut\.com/ # escaped the . -body MYCASINOBUILDER /MYCASINOBUILDER.COM/i +body MYCASINOBUILDER /MYCASINOBUILDER\.COM/i # Added ?: and replaced the {1,} with + since they're equivalent -header FROM_MALFORMED From !~ /(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(\!\S+){1,}>/ [if-unset: unset@unset.unset] +header FROM_MALFORMED From !~ /(?:\"[^\"]+\"|\S+)\@\S+\.\S+|<\S+(?:\!\S+)+>/ [if-unset: unset@unset.unset] # Don't need to enclose it in parens -header PLING_QUERY Subject =~ /(?:\?.*!|!.*\?)/ +header PLING_QUERY Subject =~ /\?.*!|!.*\?/ # Don't need to enclose it in parens -header SUBJ_HAS_SPACES Subject =~ /(?:\s{6,}|\t)/ +header SUBJ_HAS_SPACES Subject =~ /\s{6,}|\t/ # added ?: -header INVALID_DATE Date !~ /^((Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?([\d ]?\d) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d\d|\d\d\d\d) \d\d:\d\d(:\d\d)? (UT|[A-Z]{3,5}|[+-]\d\d\d\d)(\s+\(.*\))?\s*$/ +header INVALID_DATE Date !~ /^(?:(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat), )?(?:[\d ]?\d) (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:\d\d|\d\d\d\d) \d\d:\d\d(?:\:\d\d)? (?:UT|[A-Z]{3,5}|[+-]\d\d\d\d)(?:\s+\(?:.*\))?\s*$/ # added ?: -header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d\d\d(\s|$)/ +header DATE_YEAR_ZERO_FIRST Date =~ /[a-z]\s+0\d\d\d(?:\s|$)/ # added ?: and escaped . -header FRIEND_AT_PUBLIC To =~ /(yourdomain|you|your|public).(com|org|net)/i +header FRIEND_AT_PUBLIC To =~ /(?:yourdomain|you|your|public)\.(?:com|org|net)/i # added ?: -header DOMAIN_SUBJECT Subject =~ /(\s(\.|dot\s+)(info|biz|name)|domain)\b.*(extension|info|regist(ry|ration|er)|submission)/i +header DOMAIN_SUBJECT Subject =~ /(?:\s(?:\.|dot\s+)(?:info|biz|name)|domain)\b.*(?:extension|info|regist(?:ry|ration|er)|submission )/i # escaped . -header YAHOO_MSGID_ADDED ALL =~ /Message-Id: <\S+\.mail.yahoo.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s +header YAHOO_MSGID_ADDED ALL =~ /Message-Id: <\S+\.mail\.yahoo\.com>\nReceived: .*by \S+mail.yahoo.com via HTTP;/s # escaped . -header FROM_BTAMAIL From =~ /\@btamail.net.cn/i +header FROM_BTAMAIL From =~ /\@btamail\.net\.cn/i # escaped . -header FROM_UGETMORE From =~ /\@ugetmore4less.net/i +header FROM_UGETMORE From =~ /\@ugetmore4less\.net/i # escaped . -header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher.com/i +header FROM_TOPICA From =~ /\@(?:\w\.)*email-publisher\.com/i # added ?: -header Q_FOR_SELLER Subject =~ /Question.*(for|to|from eBay).*(seller|Member)/ +header Q_FOR_SELLER Subject =~ /Question.*(?:for|to|from eBay).*(?:seller|Member)/ # added ?: -uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(unsubscribe|remove)/i +uri UNSUB_SCRIPT /^https?:\/\/.*?cgi.*?(?:unsubscribe|remove)/i # the rest of these replace [eu][eu] with (?:eu|ue) to restrict what we match -body HARDCORE_PORN /\bhard[ -]?core .{0,9}(?:teen|virgin|cheerleader|amat[eu][eu]r)|\bextreme hardcore/i +body HARDCORE_PORN /\bhard[ -]?core .{0,9}(?:teen|virgin|cheerleader|amat(?:eu|ue)r)|\bextreme hardcore/i -body HOT_NASTY /\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|nau ghty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pic s|movies|video|gay|porn|hardcore|schoolgirls|amat[eu][eu]r|slut|adult|cum|xxx|sites?)\b/i +body HOT_NASTY /\b(?:horny|nasty|hot|wild|young|horniest|nasiest|hottest|wildest|youngest|best|biggest|largest|nau ghty)\b.{0,9}\b(?:virgin|asian|cheerleader|sex|selection|fuck|fucking|anal|lesbian|incest|chick|pic s|movies|video|gay|porn|hardcore|schoolgirls|amat(?:eu|ue)r|slut|adult|cum|xxx|sites?)\b/i -body AMATUER_PORN /\bamat[eu][eu]r .{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat[eu][eu]r/i +body AMATUER_PORN /\bamat(?:eu|ue)r .{0,9}(?:sex|porn|star|sites?|college|babes|action|pics)|real amat(?:eu|ue)r/i -body RAPE /\b(?:virgin|gang|teen|amat[eu][eu]r) rape|rape (?:sites?|sex)\b/i +body RAPE /\b(?:virgin|gang|teen|amat(?:eu|ue)r) rape|rape (?:sites?|sex)\b/i
Subject: Re: [SAdev] New: More general rule cleanup On Sun, Jul 21, 2002 at 12:09:35PM -0400, Theo Van Dinter wrote: > # Have gotten FPs off this, and whitespace can't be in the host, so... > # % Visit my homepage: http://i.like.foo.com % > -uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/]*%/ > +uri HTTP_ESCAPED_HOST /^https?\:\/\/[^\/\s]*%/ Actually, this one was fixed in the code (how the URLs are parsed out of the messages), so we can ignore this rule change. I forgot about that when I was looking at the rules. :)
can we resolve this bug?
Subject: Re: More general rule cleanup On Mon, Jul 29, 2002 at 11:07:22AM -0700, bugzilla-daemon@hughes-family.org wrote: > can we resolve this bug? The discussion about the rule changes just stopped. If there are no problems with the remaining changes, I'll make up a patch and we can apply.
Subject: Re: [SAdev] More general rule cleanup I think that the changes we already added and many of them were dismissed as being bad suggestions. bugzilla-daemon@hughes-family.org wrote: >http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584 > > > > > >------- Additional Comments From felicity@kluge.net 2002-07-29 12:33 ------- >Subject: Re: More general rule cleanup > >On Mon, Jul 29, 2002 at 11:07:22AM -0700, bugzilla-daemon@hughes-family.org wrote: > > >>can we resolve this bug? >> >> > >The discussion about the rule changes just stopped. If there are >no problems with the remaining changes, I'll make up a patch and we >can apply. > > > > > >------- You are receiving this mail because: ------- >You are the assignee for the bug, or are watching the assignee. > > >------------------------------------------------------- >This sf.net email is sponsored by: Dice - The leading online job board >for high-tech professionals. Search and apply for tech jobs today! >http://seeker.dice.com/seeker.epl?rel_code=31 >_______________________________________________ >Spamassassin-devel mailing list >Spamassassin-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/spamassassin-devel > > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> <title></title> </head> <body> I think that the changes we already added and many of them were dismissed as being bad suggestions.<br> <br> <a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@hughes-family.org">bugzilla-daemon@hughes-family.org</a> wrote:<br> <blockquote type="cite" cite="mid20020729193307.A6B579D208@belphegore.hughes-family.org"> <pre wrap=""><a class="moz-txt-link-freetext" href="http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584">http://www.hughes-family.org/bugzilla/show_bug.cgi?id=584</a> ------- Additional Comments From <a class="moz-txt-link-abbreviated" href="mailto:felicity@kluge.net">felicity@kluge.net</a> 2002-07-29 12:33 ------- Subject: Re: More general rule cleanup On Mon, Jul 29, 2002 at 11:07:22AM -0700, <a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@hughes-family.org">bugzilla-daemon@hughes-family.org</a> wrote: </pre> <blockquote type="cite"> <pre wrap="">can we resolve this bug? </pre> </blockquote> <pre wrap=""><!----> The discussion about the rule changes just stopped. If there are no problems with the remaining changes, I'll make up a patch and we can apply. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. ------------------------------------------------------- This sf.net email is sponsored by: Dice - The leading online job board for high-tech professionals. Search and apply for tech jobs today! <a class="moz-txt-link-freetext" href="http://seeker.dice.com/seeker.epl?rel_code=31">http://seeker.dice.com/seeker.epl?rel_code=31</a> _______________________________________________ Spamassassin-devel mailing list <a class="moz-txt-link-abbreviated" href="mailto:Spamassassin-devel@lists.sourceforge.net">Spamassassin-devel@lists.sourceforge.net</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/spamassassin-devel">https://lists.sourceforge.net/lists/listinfo/spamassassin-devel</a> </pre> </blockquote> <br> </body> </html>
Subject: Re: More general rule cleanup On Mon, Jul 29, 2002 at 01:11:01PM -0700, bugzilla-daemon@hughes-family.org wrote: > I think that the changes we already added and many of them were > dismissed as being bad suggestions. Well, some people claimed some of the changes were "unreadable". I then posted the list that didn't fit that category and there were no comments. In a quick look at current CVS, they weren't applied, so ...
ok, now checked in. sorry about the delay but there was quite a lot of changes to verify...